URL https://rstudio.cloud/learn/primers
GitHUb: https://github.com/rstudio-education/primers
These interactive tutorials were all created using the learnr package. If you would like to learn how to create your own tutorials, visit the learnr site at https://rstudio.github.io/learnr/.
Start here to learn the skills that you will rely on in every analysis (and every primer that follows): how to inspect, visualize, subset, and transform your data, as well as how to run code.
If you’re ready to begin, go to the first tutorial. There is no need to install or download anything. Each tutorial has everything you need to write and run R code, right in the tutorial.
Start here and begin making plots with R. Plots are one of the most important tools for data science; they are also one of the most fun.
Visualization is one of the most important tools for data science.
It is also a great way to start learning R; when you visualize data, you get an immediate payoff that will keep you motivated as you learn. Afterall, learning a new language can be hard!
This tutorial will teach you how to visualize data with R’s most
popular visualization package, ggplot2.
The tutorial focuses on three basic skills:
In this tutorial, we will use the core tidyverse packages, including ggplot2. I’ve already loaded the packages for you, so let’s begin!
These examples are excerpted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
Let’s begin with a question to explore.
What do you think: Do cars with bigger engines use more fuel than cars with smaller engines?
Great!
In other words, there is a positive relationship between engine size and fuel efficiency. Now let's test your hypothesis against data.
mpgYou can test your hypothesis with the mpg dataset that comes in the ggplot2 package. mpg contains observations collected on 38 models of cars by the US Environmental Protection Agency.
To see the mpg data frame, type mpg in the code block below and click “Submit Answer”.
# mpg is a data in the ggplot2 package. So required to load ggplot2 which is in tidyverse.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
"Good job! We'll use interactive code chunks like this throughout these tutorials.
Whenever you encounter one, you can click Submit Answer to run (or re-run) the code in the chunk.
If there is a Solution button, you can click it to see the answer."
You can use the black triangle that appears at the top right of the table to scroll through all of the columns in mpg.
Among the variables in mpg are:
Now let’s use this data to make our first graph.
The code below uses functions from the ggplot2 package to plot the relationship between displ and hwy.
To see the plot, click “Run Code.”
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

Can you spot the relationship?
The plot shows a negative relationship between engine size (displ) and fuel efficiency (hwy). Points that have a large value of displ have a small value of hwy and vice versa.
In other words, cars with big engines use more fuel. If that was your hypothesis, you were right!
Now let’s look at how we made the plot.
ggplot()Here’s the code that we used to make the plot. Notice that it contains three functions: ggplot(), geom_point(), and aes().
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
In R, a function is a name followed by a set of parentheses. Many functions require special information to do their jobs, and you write this information between the parentheses.
The first function, ggplot(), creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph.
By itself, ggplot(data = mpg) creates an empty graph, which looks like this.
ggplot(data = mpg)

geom_point() adds a layer of points to the empty plot created by ggplot(). This gives us a scatterplot.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
#### mapping = aes()
geom_point() takes a mapping argument that defines which variables in your dataset are mapped to which axes in your graph. The mapping argument is always paired with the function aes(), which you use to gather together all of the mappings that you want to create.
Here, we want to map the displ variable to the x axis and the hwy variable to the y axis, so we add x = displ and y = hwy inside of aes() (and we separate them with a comma).
Where will ggplot2 look for these mapped variables? In the data frame that we passed to the data argument, in this case, mpg.
Our code follows the common workflow for making graphs with ggplot2. To make a graph, you:
In fact, you can turn our code into a reusable template for making graphs. To make a graph, replace the bracketed sections in the code below with a data set, a geom_ function, or a collection of mappings.
Give it a try! Replace the bracketed sections with mpg, geom_boxplot, and x = class, y = hwy to make a slightly different graph. Be sure to delete the # symbols before you run the code.
# ggplot(data = <DATA>) +
# <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy))

"Good job! This plot uses boxplots to compare the fuel efficiencies of different types of cars. ggplot2 comes with many geom functions that each add a different type of layer to a plot. You'll learn more about boxplots and other geoms in the tutorials that follow."
As you start to run R code, you’re likely to run into problems. Don’t worry — it happens to everyone. I have been writing R code for years, and every day I still write code that doesn’t work!
Start by carefully comparing the code that you’re running to the code in the examples. R is extremely picky, and a misplaced character can make all the difference. Make sure that every ( is matched with a ) and every ” is paired with another “. Also pay attention to capitalization; R is case sensitive.
One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of a line, not the start. In other words, make sure you haven’t accidentally written code like this:
ggplot(data = mpg)
+ geom_point(mapping = aes(x = displ, y = hwy))
If you’re still stuck, try the help. You can get help about any R function by running ?function_name in a code chunk, e.g. ?geom_point. Don’t worry if the help doesn’t seem that helpful — instead skip down to the bottom of the help page and look for a code example that matches what you’re trying to do.
If that doesn’t help, carefully read the error message that appears when you run your (non-working) code. Sometimes the answer will be buried there! But when you’re new to R, you might not yet know how to understand the error message. Another great tool is Google: try googling the error message, as it’s likely someone else has had the same problem, and has gotten help online.
Run ggplot(data = mpg) what do you see?
ggplot(data = mpg)

"Good job! A ggplot that has no layers looks blank. To finish the graph, add a geom function."
Make a scatterplot of cty vs hwy.
ggplot(data = mpg) +
geom_point(aes(x = cty, y = hwy))

"Excellent work!"
What happens if you make a scatterplot of class vs drv. Try it. Why is the plot not useful?
ggplot(data = mpg) +
geom_point(aes(x = class, y = drv))

"Nice job! `class` and `drv` are both categorical variables. As a result, points can only appear at certain values, where many points overlap each other. You have no idea how many points fall on top of each other at each location. Experiment with geom_count() to find a better solution."
“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey
In the plot below, one group of points (highlighted in red) seems to fall outside of the linear trend between engine size and gas mileage. These cars have a higher mileage than you might expect. How can you explain these cars?
image
Let’s hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the class value for each car. The class variable of the mpg dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular). To check this, we need to add the class variable to the plot.
You can add a third variable, like class, to a two dimensional scatterplot by mapping it to a new aesthetic. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.
You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word “value” to describe data, let’s use the word “level” to describe aesthetic properties. Here we change the levels of a point’s size, shape, and color to make the point small, triangular, or blue:
image
We can add the class variable to the plot by mapping the levels of an aesthetic (like color) to the values of class. For example, we can color a point green if it belongs to the compact class, blue if it belongs to the midsize class, and so on.
Let’s give this a try. Fill in the blank piece of code below with color = class. What happens? Delete the commenting symbols (#) before running your code. (If you prefer British English, you can use colour instead of color.)
# ggplot(data = mpg) +
# geom_point(mapping = aes(x = displ, y = hwy, ____________))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))

The colors reveal that many of the unusual points in mpg are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.
This isn’t the only insight we’ve gleaned; you’ve also learned how to add new aesthetics to your graph. Let’s review the process.
To map an aesthetic to a variable, set the name of the aesthetic equal to the name of the variable, and do this inside mapping = aes(). ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable. ggplot2 will also add a legend that explains which levels correspond to which values.
This insight gives us a new way to think about the mapping argument. Mappings tell ggplot2 more than which variables to put on which axes, they tell ggplot2 which variables to map to which visual properties. The x and y locations of each point are just two of the many visual properties displayed by a point.
In the above example, we mapped color to class, but we could have mapped size to class in the same way.
Change the code below to map size to class. What happens?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.

"Great Job! Now the size of a point represents its class. Did you notice the warning message? ggplot2 gives us a warning here because mapping an unordered variable (class) to an ordered aesthetic (size) is not a good idea."
You can also map class to the alpha aesthetic, which controls the transparency of the points. Try it below.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.

"Great Job! If you look closely, you can spot something subtle: many locations contain multiple points stacked on top of each other (alpha is additive so multiple transparent points will appear opaque)."
Let’s try one more aesthetic. This time map the class of the points to shape, then look for the SUVs. What happened?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (`geom_point()`).

"Good work! What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic. So only use it when you have fewer than seven groups."
In the code below, map cty, which is a continuous variable, to color, size, and shape. How do these aesthetics behave differently for continuous variables, like cty, vs. categorical variables, like class?
# Map cty to color
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
# Map cty to size
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
# Map cty to shape
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
# Map cty to color
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cty))

# Map cty to size
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cty))

# Map cty to shape
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = cty))
A continuous variable can not be mapped to shape
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (`geom_point()`).

"Very nice! ggplot2 treats continuous and categorical variables differently. Noteably, ggplot2 supplies a blue gradient when you map a continuous variable to color, and ggplot2 will not map continuous variables to shape."
Map class to color, size, and shape all in the same plot. Does it work?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class, size = class, shape = class))
## Warning: Using size for a discrete variable is not advised.
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (`geom_point()`).

"Very nice! ggplot2 can map the same variable to multiple aesthetics."
What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Try it.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))

"Good job! ggplot2 will map the aesthetic to the results of the expression. Here, ggplot2 mapped the color of each point to TRUE or FALSE based on whther the point's `displ` value was less than five."
What if you just want to make all of the points in your plot blue, like in the plot below?
You can do
this by setting the color aesthetic outside of the aes() function, like
this
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Setting works for every aesthetic in ggplot2. If you want to manually set the aesthetic to a value in the visual space, set the aesthetic outside of aes().
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue", shape = 3, alpha = 0.5)

If you want to map the aesthetic to a variable in the data space, map the aesthetic inside aes().
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class, shape = fl, alpha = displ))

What do you think went wrong in the code below? Fix the code so it does something sensible.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

"Good job! Putting an aesthetic in the wrong location is one of the most common graphing errors. Sometimes it helps to think of legends. If you will need a legend to understand what the color/shape/etc. means, then you should probably put the aesthetic inside `aes()` --- ggplot2 will build a legend for every aesthetic mapped here. If the aesthetic has no meaning and is just... well, aesthetic, then set it outside of `aes()`."
For each aesthetic, you associate the name of the aesthetic with a variable to display, and you do this within aes().
Once you map a variable to an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.
You’ve experimented with the most common aesthetics for points: x, y, color, size, alpha and shape. Each geom uses its own set of aesthetics (you wouldn’t expect a line to have a shape, for example). To find out which aesthetics a geom uses, open its help page, e.g. ?geom_line.
This raises a new question that we’ve only brushed over: what is a geom?
How are these two plots similar?

Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different geoms.
A geom is the geometrical object that a plot uses to represent observations. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.
As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.
To change the geom in your plot, change the geom function that you add to ggplot(). For example, take this code which makes the plot on the left (above), and change geom_point() to geom_smooth(). What do you get?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

"Good job! You get the plot on the right (above)."
ggplot2 provides over 30 geom functions that you can use to make plots, and extension packages provide even more (see https://exts.ggplot2.tidyverse.org/gallery/ for a sampling). You’ll learn how to use these geoms to explore data in the Visualize Data primer.
Until then, the best way to get a comprehensive overview of the available geoms is with the ggplot2 cheatsheet. To learn more about any single geom, look at its help page, e.g. ?geom_smooth.
What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
What does the se argument to geom_smooth() do?
The ideas that you’ve learned here: geoms, aesthetics, and the implied existence of a data space and a visual space combine to form a system known as the Grammar of Graphics.
The Grammar of Graphics provides a systematic way to build any graph, and it underlies the ggplot2 package. In fact, the first two letters of ggplot2 stand for “Grammar of Graphics”.
The best way to understand the Grammar of Graphics is to see it explained in action:
Video: https://vimeo.com/223812632
What is a package?
Throughout this tutorial, I’ve referred to ggplot2 as a package. What does that mean?
The R language is subdivided into packages, small collections of data sets and functions that all focus on a single task. The functions that we used in this tutorial come from one of those packages, the ggplot2 package, which focuses on visualizing data.
When you first install R, you get a small collection of core packages known as base R. The remaining packages—there are over 10,000 of them—are optional. You don’t need to install them unless you want to use them.
ggplot2 is one of these optionals packages, so are the other packages that we will look at in these tutorials. Some of the most popular and most modern parts of R come in the optional packages.
You don’t need to worry about installing packages in these tutorials. Each tutorial comes with all of the packages that you need pre-installed; this is how we make the tutorials easy to use.
However, one day, you may want to use R outside of these tutorials. When that day comes, you’ll want to remember which packages to download to acquire the functions you use here. Throughout the tutorials, I will try to make it as clear as possible where each function comes from!
If you’d like to learn more about installing R packages (or R or the RStudio IDE), the Set Up video tutorial walks you through the process of setting up R on your own computer.
Congratulations! You can use the ggplot2 code template to plot any dataset in many different ways. As you begin exploring data, you should incorporate these tools into your workflow.
There is much more to ggplot2 and Data Visualization than we have covered here. If you would like to learn more about visualizing data with ggplot2, check out RStudio’s primer on Data Visualization.
Your new data visualization skills will make it easier to learn other parts of R, because you can now visualize the results of any change that you make to data. you’ll put these skills to immediate use in the next tutorial, which will show you how to extract values from datasets, as well as how to compute new variables and summary statistics from your data. See you there.
This tutorial demystifies programming with R. Here, you’ll learn how to run functions and build objects.
R is easiest to use when you know how the R language works. This tutorial will teach you the implicit background knowledge that informs every piece of R code. You’ll learn about:
Video: https://vimeo.com/220490105
Can you use the sqrt() function in the chunk below to compute the square root of 962?
sqrt(962)
## [1] 31.01612
Use the code chunk below to examine the code that sqrt() runs.
sqrt
## function (x) .Primitive("sqrt")
"Good job! sqrt immediately triggers a low level algorithm optimized for performance, so there is not much code to see."
Compare the code in sqrt() to the code in another R function, lm(). Examine lm()’s code body in the chunk below.
lm
## function (formula, data, subset, weights, na.action, method = "qr",
## model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
## contrasts = NULL, offset, ...)
## {
## ret.x <- x
## ret.y <- y
## cl <- match.call()
## mf <- match.call(expand.dots = FALSE)
## m <- match(c("formula", "data", "subset", "weights", "na.action",
## "offset"), names(mf), 0L)
## mf <- mf[c(1L, m)]
## mf$drop.unused.levels <- TRUE
## mf[[1L]] <- quote(stats::model.frame)
## mf <- eval(mf, parent.frame())
## if (method == "model.frame")
## return(mf)
## else if (method != "qr")
## warning(gettextf("method = '%s' is not supported. Using 'qr'",
## method), domain = NA)
## mt <- attr(mf, "terms")
## y <- model.response(mf, "numeric")
## w <- as.vector(model.weights(mf))
## if (!is.null(w) && !is.numeric(w))
## stop("'weights' must be a numeric vector")
## offset <- model.offset(mf)
## mlm <- is.matrix(y)
## ny <- if (mlm)
## nrow(y)
## else length(y)
## if (!is.null(offset)) {
## if (!mlm)
## offset <- as.vector(offset)
## if (NROW(offset) != ny)
## stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
## NROW(offset), ny), domain = NA)
## }
## if (is.empty.model(mt)) {
## x <- NULL
## z <- list(coefficients = if (mlm) matrix(NA_real_, 0,
## ncol(y)) else numeric(), residuals = y, fitted.values = 0 *
## y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
## 0) else ny)
## if (!is.null(offset)) {
## z$fitted.values <- offset
## z$residuals <- y - offset
## }
## }
## else {
## x <- model.matrix(mt, mf, contrasts)
## z <- if (is.null(w))
## lm.fit(x, y, offset = offset, singular.ok = singular.ok,
## ...)
## else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
## ...)
## }
## class(z) <- c(if (mlm) "mlm", "lm")
## z$na.action <- attr(mf, "na.action")
## z$offset <- offset
## z$contrasts <- attr(x, "contrasts")
## z$xlevels <- .getXlevels(mt, mf)
## z$call <- cl
## z$terms <- mt
## if (model)
## z$model <- mf
## if (ret.x)
## z$x <- x
## if (ret.y)
## z$y <- y
## if (!qr)
## z$qr <- NULL
## z
## }
## <bytecode: 0x1301bcc28>
## <environment: namespace:stats>
Wow! lm() runs a lot of code. What does it do? Open the help page for lm() in the chunk below and find out.
? lm
lm {stats} R Documentation
Fitting Linear Models
Description
lm is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).
Usage
lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)
Arguments
formula
an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.
data
an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.
subset
an optional vector specifying a subset of observations to be used in the fitting process.
weights
an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used. See also ‘Details’,
na.action
a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful.
method
the method to be used; for fitting, currently only method = "qr" is supported; method = "model.frame" returns the model frame (the same as with model = TRUE, see below).
model, x, y, qr
logicals. If TRUE the corresponding components of the fit (the model frame, the model matrix, the response, the QR decomposition) are returned.
singular.ok
logical. If FALSE (the default in S but not in R) a singular fit is an error.
contrasts
an optional list. See the contrasts.arg of model.matrix.default.
offset
this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be NULL or a numeric vector or matrix of extents matching those of the response. One or more offset terms can be included in the formula instead or as well, and if more than one are specified their sum is used. See model.offset.
...
additional arguments to be passed to the low level regression fitting functions (see below).
Details
Models for lm are specified symbolically. A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.
If the formula includes an offset, this is evaluated and subtracted from the response.
If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix.
See model.matrix for some further details. The terms in the formula will be re-ordered so that main effects come first, followed by the interactions, all second-order, all third-order and so on: to avoid this pass a terms object as the formula (see aov and demo(glm.vr) for an example).
A formula has an implied intercept term. To remove this use either y ~ x - 1 or y ~ 0 + x. See formula for more details of allowed formulae.
Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized). However, in the latter case, notice that within-group variation is not used. Therefore, the sigma estimate and residual degrees of freedom may be suboptimal; in the case of replication weights, even wrong. Hence, standard errors and analysis of variance tables should be treated with care.
lm calls the lower level functions lm.fit, etc, see below, for the actual numerical computations. For programming only, you may consider doing likewise.
All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.
Value
lm returns an object of class "lm" or for multiple responses of class c("mlm", "lm").
The functions summary and anova are used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.
An object of class "lm" is a list containing at least the following components:
coefficients
a named vector of coefficients
residuals
the residuals, that is response minus fitted values.
fitted.values
the fitted mean values.
rank
the numeric rank of the fitted linear model.
weights
(only for weighted fits) the specified weights.
df.residual
the residual degrees of freedom.
call
the matched call.
terms
the terms object used.
contrasts
(only where relevant) the contrasts used.
xlevels
(only where relevant) a record of the levels of the factors used in fitting.
offset
the offset used (missing if none were used).
y
if requested, the response used.
x
if requested, the model matrix used.
model
if requested (the default), the model frame used.
na.action
(where relevant) information returned by model.frame on the special handling of NAs.
In addition, non-null fits will have components assign, effects and (unless not requested) qr relating to the linear fit, for use by extractor functions such as summary and effects.
Using time series
Considerable care is needed when using lm with time series.
Unless na.action = NULL, the time series attributes are stripped from the variables before the regression is done. (This is necessary as omitting NAs would invalidate the time series attributes, and if NAs are omitted in the middle of the series the result would no longer be a regular time series.)
Even if the time series attributes are retained, they are not used to line up series, so that the time shift of a lagged or differenced regressor would be ignored. It is good practice to prepare a data argument by ts.intersect(..., dframe = TRUE), then apply a suitable na.action to that data frame and call lm with na.action = NULL so that residuals and fitted values are time series.
Note
Offsets specified by offset will not be included in predictions by predict.lm, whereas those specified by an offset term in the formula will be.
Author(s)
The design was inspired by the S function of the same name described in Chambers (1992). The implementation of model formula by Ross Ihaka was based on Wilkinson & Rogers (1973).
References
Chambers, J. M. (1992) Linear models. Chapter 4 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
Wilkinson, G. N. and Rogers, C. E. (1973). Symbolic descriptions of factorial models for analysis of variance. Applied Statistics, 22, 392–399. doi: 10.2307/2346786.
See Also
summary.lm for summaries and anova.lm for the ANOVA table; aov for a different interface.
The generic functions coef, effects, residuals, fitted, vcov.
predict.lm (via predict) for prediction, including confidence and prediction intervals; confint for confidence intervals of parameters.
lm.influence for regression diagnostics, and glm for generalized linear models.
The underlying low level functions, lm.fit for plain, and lm.wfit for weighted regression fitting.
More lm() examples are available e.g., in anscombe, attitude, freeny, LifeCycleSavings, longley, stackloss, swiss.
biglm in package biglm for an alternative way to fit linear models to large datasets (especially those with many cases).
Examples
require(graphics)
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
lm.D90 <- lm(weight ~ group - 1) # omitting intercept
anova(lm.D9)
summary(lm.D90)
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(lm.D9, las = 1) # Residuals, Fitted, ...
par(opar)
### less simple examples in "See Also" above
[Package stats version 4.0.3 Index]
"Good job! `lm()` is R's function for fitting basic linear models. No wonder it runs so much code.
What do you think the chunk below will return? Run it and see. The result should be nothing. R will not run anything on a line after a # symbol. This is useful because it lets you write human readable comments in your code: just place the comments after a #. Now delete the # and re-run the chunk. You should see a result.
# sqrt(961)
sqrt(961)
## [1] 31
Video: https://vimeo.com/220490157
rnorm() is a function that generates random variables from a normal distribution. Find the arguments of rnorm()
args(rnorm)
## function (n, mean = 0, sd = 1)
## NULL
"Good job! `n` specifies the number of random normal variables to generate. `mean` and `sd` describe the distribution to generate the random values with."
Which arguments of R norm are optional?
n is not an optional argument because it does not have a default value.
Use rnrom() to generate 100 random normal values with a mean of 100 and a standard deviation of 15.
rnorm(100, mean = 100, sd = 15)
## [1] 106.77717 96.64706 94.27992 96.79427 83.51553 110.83675 84.00723
## [8] 109.56798 96.76815 88.90998 90.21935 107.48061 127.74486 122.25443
## [15] 101.46010 103.76057 92.56121 124.89257 91.54704 111.86213 100.27495
## [22] 126.79548 77.56887 81.47184 96.89119 95.15990 120.12624 115.44153
## [29] 95.59624 72.88438 92.43411 125.50980 97.08877 105.41194 90.02012
## [36] 115.76164 96.75034 97.22880 132.41666 96.00209 102.58404 83.19374
## [43] 103.83508 102.02852 102.47443 88.20627 115.23612 98.91257 97.92002
## [50] 107.63582 95.99832 89.69415 105.59018 97.78842 97.71702 95.59797
## [57] 75.44058 105.78448 96.04326 107.51356 97.96445 95.35418 131.77878
## [64] 100.80586 106.23207 109.35130 104.02730 75.68339 109.57379 92.17855
## [71] 95.37452 85.55911 134.30015 132.33867 96.90743 112.53948 99.97960
## [78] 86.59587 84.53053 97.75938 68.35060 89.29400 111.79914 120.81460
## [85] 130.65068 114.66434 105.76461 93.31965 113.09996 81.32628 106.24608
## [92] 92.41039 84.21390 125.70457 94.96537 95.86193 125.58928 69.98398
## [99] 101.56256 107.50966
Can you spot the error in the code below? Fix the code and then re-run it.
rnorm(100, mu = 100, sd = 50)
rnorm(100, mean = 100, sd = 50)
## [1] 23.047135 194.939609 34.226652 168.869725 94.919515 104.062345
## [7] 126.053860 152.530388 157.177136 72.605948 83.767693 139.228550
## [13] 104.157393 67.346849 101.156752 133.313704 12.633888 151.727114
## [19] 35.080397 88.844966 123.158523 32.803794 70.592004 30.095346
## [25] 116.775865 115.686451 114.683681 96.252669 108.933987 83.146571
## [31] 157.915458 47.543798 47.479208 42.485658 139.542386 17.188037
## [37] 120.415898 126.567071 81.390605 68.205015 32.445949 50.481090
## [43] 181.586240 66.049696 160.215740 37.961944 62.128938 144.913464
## [49] 124.586252 108.620969 122.775460 88.737818 70.901842 166.739024
## [55] 86.633994 129.586291 78.467321 77.709106 130.581754 28.166392
## [61] 53.946862 78.295532 218.691506 159.565607 58.587636 133.599970
## [67] 60.483692 27.036205 45.483835 183.997753 225.351244 156.565826
## [73] 87.168412 151.186104 69.423703 84.847331 110.217830 103.992208
## [79] 165.912385 42.901253 132.188116 73.574976 75.011586 50.489967
## [85] 84.461263 129.312759 147.992591 163.795431 20.237342 87.143155
## [91] -45.531370 90.327897 44.753952 46.746033 44.468302 132.605460
## [97] -7.850684 197.588639 130.041100 125.700315
Video: https://vimeo.com/220493412
You can choose almost any name you like for an object, as long as the name does not begin with a number or a special character like +, -, *, /, ^, !, @, or &.
Which of these would be valid object names?
Remember that the most helpful names will remind you what you put in your object.
In the code chunk below, save the results of rnorm(100, mean = 100, sd = 15) to an object named data. Then, on a new line, call the hist() function on data to plot a histogram of the random values.
data <- rnorm(100, mean = 100, sd = 15)
hist(data)

What do you think would happen if you assigned data to a new object named copy, like this? Run the code and then inspect both data and copy.
data <- rnorm(100, mean = 100, sd = 15)
copy <- data
identical(copy, data)
## [1] TRUE
"Good job! R saves a copy of the contents in data to copy."
Objects provide an easy way to store data sets in R. In fact, R comes with many toy data sets pre-loaded. Examine the contents of iris to see a classic toy data set. Hint: how could you learn more about the iris object?
iris
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with 140 more rows
"Good job! You can learn more about iris by examining its help page with `?iris`."
What if you accidentally overwrite an object? If that object came with R or one of its packages, you can restore the original version of the object by removing your version with rm(). Run rm() on iris below to restore the iris data set.
iris <- 1
iris
## [1] 1
rm(iris)
iris
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with 140 more rows
"Good job! Unfortunately, `rm()` cannot help you if you overwrite one of your own objects."
Video: https://vimeo.com/220490316
In the chunk below, create a vector that contains the integers from one to ten.
c(1,2,3,4,5,6,7,8,9,10)
## [1] 1 2 3 4 5 6 7 8 9 10
If your vector contains a sequence of contiguous integers, you can create it with the : shortcut. Run 1:10 in the chunk below. What do you get? What do you suppose 1:20 would return?
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
You can extract any element of a vector by placing a pair of brackets behind the vector. Inside the brackets place the number of the element that you’d like to extract. For example, vec[3] would return the third element of the vector named vec.
Use the chunk below to extract the fourth element of vec.
vec <- c(1, 2, 4, 8, 16)
vec[4]
## [1] 8
You can also use [] to extract multiple elements of a vector. Place the vector c(1,2,5) between the brackets below. What does R return?
vec <- c(1, 2, 4, 8, 16)
vec[]
vec <- c(1, 2, 4, 8, 16)
vec[c(1,2,5)]
## [1] 1 2 16
If the elements of your vector have names, you can extract them by name. To do so place a name or vector of names in the brackets behind a vector. Surround each name with quotation marks, e.g. vec2[c(“alpha”, “beta”)].
Extract the element named gamma from the vector below.
vec2 <- c(alpha = 1, beta = 2, gamma = 3)
vec2 <- c(alpha = 1, beta = 2, gamma = 3)
vec2["gamma"]
## gamma
## 3
Predict what the code below will return. Then look at the result.
c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) + c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
## [1] 2 4 6 8 10 12 14 16 18 20
"Good job! Like many R functions, R's math operators are vectorised: they're designed to work with vectors by repeating the operation for each pair of elements."
Predict what the code below will return. Then look at the result.
1 + c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
## [1] 2 3 4 5 6 7 8 9 10 11
"Good job! Whenever you try to work with vectors of varying lengths (recall that `1` is a vector of length one), R will repeat the shorter vector as needed to compute the result."
Video: https://vimeo.com/220490241
Which of these is not an atomic data type
What type of data is “1L”
Create a vector of integers from one to five. Can you imagine why you might want to use integers instead of numbers/doubles?
c(1L, 2L, 3L, 4L, 5L)
## [1] 1 2 3 4 5
Computers must use a finite amount of memory to store decimal numbers (which can sometimes require infinite precision). As a result, some decimals can only be saved as very precise approximations. From time to time you’ll notice side effects of this imprecision, like below.
Compute the square root of two,square the answer (e.g. multiply the square root of two by the square root of two), and then subtract two from the result. What answer do you expect? What answer do you get?
sqrt(2)^2 - 2
## [1] 4.440892e-16
How many types of data can you put into a single vector?
One of the most common mistakes in R is to call an object when you mean to call a character string and vice versa.
Which of these are object names? What is the difference between object names and character strings?
Character strings are surrounded by quotation marks, object names are not.
Video: https://vimeo.com/220490360
Which data structure(s) could you use to store these pieces of data in the same object? 1001, TRUE, “stories”.
Make a list that contains the elements 1001, TRUE, and “stories”. Give each element a name.
list(num = 1001, logic = TRUE, char = "stories")
## $num
## [1] 1001
##
## $logic
## [1] TRUE
##
## $char
## [1] "stories"
Extract the number 1001 from the list below.
things <- list(number = 1001, logical = TRUE, string = "stories")
things <- list(number = 1001, logical = TRUE, string = "stories")
things$number
## [1] 1001
You can make a data frame with the data.frame() function, which works similar to c(), and list(). Assemble the vectors below into a data frame with the column names numbers, logicals, strings.
nums <- c(1, 2, 3, 4)
logs <- c(TRUE, TRUE, FALSE, TRUE)
strs <- c("apple", "banana", "carrot", "duck")
nums <- c(1, 2, 3, 4)
logs <- c(TRUE, TRUE, FALSE, TRUE)
strs <- c("apple", "banana", "carrot", "duck")
data.frame(numbers = nums, logicals = logs, strings = strs)
## # A tibble: 4 × 3
## numbers logicals strings
## <dbl> <lgl> <chr>
## 1 1 TRUE apple
## 2 2 TRUE banana
## 3 3 FALSE carrot
## 4 4 TRUE duck
"Good Job. When you make a data frame, you must follow one rule: each column vector should be the same length."
Given that a data frame is a type of list (with named elements), how could you extract the strings column of the df data frame below? Do it.
nums <- c(1, 2, 3, 4)
logs <- c(TRUE, TRUE, FALSE, TRUE)
strs <- c("apple", "banana", "carrot", "duck")
df <- data.frame(numbers = nums, logicals = logs, strings = strs)
nums <- c(1, 2, 3, 4)
logs <- c(TRUE, TRUE, FALSE, TRUE)
strs <- c("apple", "banana", "carrot", "duck")
df <- data.frame(numbers = nums, logicals = logs, strings = strs)
df$strings
## [1] "apple" "banana" "carrot" "duck"
Video: https://vimeo.com/220490447
What does this common error message suggest? object _____ does not exist.
In the code chunk below, load the tidyverse package. Whenever you load a package R will also load all of the packages that the first package depends on. tidyverse takes advantage of this to create a shortcut for loading several common packages at once. Whenever you load tidyverse, tidyverse also loads ggplot2, dplyr, tibble, tidyr, readr, and purrr.
library(tidyverse)
"Good job! R will keep the packages loaded until you close your R session. When you re-open R, you'll need to reload you packages."
Last value being used to check answer is invisible. See `?invisible` for more information
Did you know, library() is a special function in R? You can pass library() a package name in quotes, like library(“tidyverse”), or not in quotes, like library(tidyverse)—both will work! That’s often not the case with R functions.
In general, you should always use quotes unless you are writing the name of something that is already loaded into R’s memory, like a function, vector, or data frame.
But what if the package that you want to load is not installed on your computer? How would you install the dplyr package on your own computer?
install.packages("dplyr")
unable to install packages
Congratulations. You now have a formal sense for how the basics of R work. Although you may think of your self as a Data Scientist, this brief Computer Science background will help you as you analyze data. Whenever R does something unexpected, you can apply your knowledge of how R works to figure out what went wrong.
Learn the most important data handling skills in R: how to extract values from a table, subset tables, calculate summary statistics, and derive new variables.
If you’re ready to begin, go to the first tutorial. There is no need to install or download anything. Each tutorial has everything you need to write and run R code, right in the tutorial.
Learn to use tibbles, the most user-friendly tabular data structure in R, as well as how to manage tidyverse packages with… the tidyverse package.
In this primer, you will explore the popularity of different names over time. To succeed, you will need to master some common tools for manipulating data with R:
These are some of the most useful R functions for data science, and the tutorials that follow will provide you everything you need to learn them.
In the tutorials, we’ll use a dataset named babynames, which comes in a package that is also named babynames. Within babynames, you will find information about almost every name given to children in the United States since 1880.
This tutorial introduces babynames as well as a new data structure that makes working with data in R easy: the tibble.
In addition to babynames, this tutorial uses the core tidyverse packages, including ggplot2, tibble, and dplyr. All of these packages have been pre-installed for your convenience. But they haven’t been pre-loaded—something you will soon learn more about!
Click the Next Topic button to begin.
Package
Before we begin, let’s learn a little about our data. The babynames dataset comes in the babynames package. The package is pre-installed for you, just as ggplot2 was pre-installed in the last tutorial. But unlike in the last tutorial, I have not pre-loaded babynames, or any other package.
What does this mean? In R, whenever you want to use a package that is not part of base R, you need to load the package with the command library(). Until you load a package, R will not be able to find the datasets and functions contained in the package. For example, if we asked R to display the babynames dataset, which comes in the babynames package, right now, we’d get the message below. R cannot find the dataset because we haven’t loaded the babynames package.
## Error in eval(expr, envir, enclos): object 'babynames' not found
To load the babynames package, you would run the command library(babynames). After you load a package, R will be able to find its contents until you close R. The next time you open R, you will need to reload the package if you wish to use it again.
This might sound like an inconvenience, but choosing which packages to load keeps your R experience simple and orderly.
In the chunk below, load babynames (the package) and then open the help page for babynames (the data set). Be sure to read the help page before going on.
library(babynames)
Now that you know a little about the dataset, let’s examine its contents. If you were to run babynames at your R console, you would get output that looks like this:
babynames
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # … with 1,924,655 more rows
Yikes. What is happening?
babynames is a large data frame, and R is not well equiped to display the contents of large data frames. R shows as many rows as possible before your memory buffer is overwhelmed. At that point, R stops, leaving you to look at an arbitrary section of your data.
You can avoid this behaviour by transforming your data frame to a tibble.
A tibble is a special type of table. R displays tibbles in a refined way whenever you have the tibble package loaded: R will print only the first ten rows of a tibble as well as all of the columns that fit into your console window. R also adds useful summary information about the tibble, such as the data types of each column and the size of the data set.
Whenever you do not have the tibble packages loaded, R will display the tibble as if it were a data frame. In fact, tibbles are data frames, an enhanced type of data frame.
You can think of the difference between the data frame display and the tibble display like this:
####
as_tibble()
You can transform a data frame to a tibble with the as_tibble() function in the tibble package, e.g. as_tibble(cars). However, babynames is already a tibble. To display it nicely, you just need to load the tibble package.
To see what I mean, use library() to load the tibble package in the chunk below and then call babynames.
library(tibble)
babynames
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # … with 1,924,655 more rows
"Excellent! If you want to check whether or not an object is a tibble, you can use the `is_tibble()` function that comes in the tibble package. For example, this would return TRUE: `is_tibble(babynames)`."
You do not need to worry much about tibbles in these tutorials; in future tutorials, I’ll automatically convert each data frame into an interactive table. However, you should consider making tibbles an important part of your work in R.
What if you’d like to inspect the remaining portions of a tibble? To see the entire tibble, use the View() command. R will launch a window that shows a scrollable display of the entire data set. For example, the code below will launch a data viewer in the RStudio IDE.
View(babynames)
View() works in conjunction with the software that you run R from: View() opens the data editor provided by that software. Unfortunately, this tutorial doesn’t come with a data editor, so you won’t be able to use View() today (unless you open the RStudio IDE, for example, and run the code there).
The tibble package is one of several packages that are known collectively as “the tidyverse”. Tidyverse packages share a common philosophy and are designed to work well together. For example, in this tutorial you will use the tibble package, the ggplot2 package, and the dplyr package, all of which belong to the tidyverse.
When you use tidyverse packages, you can make your life easier by using the tidyverse package. The tidyverse package provides a shortcut for installing and loading the entire suite of packages in “the tidyverse”, e.g.
Think of the tidyverse package as a placeholder for the packages that are in the “tidyverse”. By itself, tidyverse does not do much, but when you install the tidyverse package it instructs R to install every other package in the tidyverse at the same time. In other words, when you run install.packages(“tidyverse”), R installs the following packages for you in one simple step:
When you load tidyverse with library(“tidyverse”), it instructs R to load the most commonly used tidyverse packages. These are:
You can load the less commonly used tidyverse packages in the normal
way, by running library(
Let’s give this a try. We will use the ggplot2 and dplyr packages later in this tutorial. Let’s use the tidyverse package to load them in the chunk below:
library(tidyverse)
Which package is not loaded by library(“tidyverse”)
Correct!
Now that you are familiar with the data set, and have loaded the necessary packages, let's explore the data.
Tibbles and the tidyverse package are two tools that make life with R easier. Ironically, you may not come to appreciate their value right away: these tutorials pre-load packages for you, and they wrap data frames into an interactive table for display (at least the tutorials in the primers that follow will). However, you will want to utilize tibbles and the tidyverse package when you move out of the tutorials and begin doing your own work with R inside of the RStudio IDE.
This tutorial also introduced the babynames dataset. In the next tutorial, you will use this data set to plot the popularity of your name over time. Along the way, you will learn how to filter and subset data sets in R.
Master three simple functions for finding, and extracting, the data in your data set. Here you will learn to select variables, filter observations, and arrange values. Here, you will also meet R’s pipe operator, %>%.
In this case study, you will explore the popularity of your own name over time. Along the way, you will master some of the most useful functions for isolating variables, cases, and values within a data frame:
This tutorial uses the core tidyverse packages, including ggplot2, tibble, and dplyr, as well as the babynames package. All of these packages have been pre-installed and pre-loaded for your convenience.
Click the Next Topic button to begin.
The history of your name
You can use the data in babynames to make graphs like this, which reveal the history of a name, perhaps your name.
But before
you do, you will need to trim down babynames. At the moment, there are
more rows in babynames than you need to build your plot.
To see what I mean, consider how I made the plot above: I began with the entire data set, which if plotted as a scatterplot would’ve looked like this.
I then
narrowed the data to just the rows that contain my name, before plotting
the data with a line geom. Here’s how the rows with just my name look as
a scatterplot.
If I had
skipped this step, my line graph would’ve connected all of the points in
the large data set, creating an uninformative graph.
Your goal in
this section is to repeat this process for your own name (or a name that
you choose). Along the way, you will learn a set of functions that
isolate information within a data set.
This type of task occurs often in Data Science: you need to extract data from a table before you can use it. You can do this task quickly with three functions that come in the dplyr package:
Each function takes a data frame or tibble as it’s first argument and returns a new data frame or tibble as its output.
select() extracts columns of a data frame and returns the columns as a new data frame. To use select(), pass it the name of a data frame to extract columns from, and then the names of the columns to extract. The column names do not need to appear in quotation marks or be prefixed with a $; select() knows to find them in the data frame that you supply.
Use the example below to get a feel for select(). Can you extract just the name column? How about the name and year columns? How about all of the columns except prop?
select(babynames, name, sex)
## # A tibble: 1,924,665 × 2
## name sex
## <chr> <chr>
## 1 Mary F
## 2 Anna F
## 3 Emma F
## 4 Elizabeth F
## 5 Minnie F
## 6 Margaret F
## 7 Ida F
## 8 Alice F
## 9 Bertha F
## 10 Sarah F
## # … with 1,924,655 more rows
# Can you extract just the name column?
select(babynames, name)
## # A tibble: 1,924,665 × 1
## name
## <chr>
## 1 Mary
## 2 Anna
## 3 Emma
## 4 Elizabeth
## 5 Minnie
## 6 Margaret
## 7 Ida
## 8 Alice
## 9 Bertha
## 10 Sarah
## # … with 1,924,655 more rows
# How about the name and year columns?
select(babynames, name, year)
## # A tibble: 1,924,665 × 2
## name year
## <chr> <dbl>
## 1 Mary 1880
## 2 Anna 1880
## 3 Emma 1880
## 4 Elizabeth 1880
## 5 Minnie 1880
## 6 Margaret 1880
## 7 Ida 1880
## 8 Alice 1880
## 9 Bertha 1880
## 10 Sarah 1880
## # … with 1,924,655 more rows
# How about all of the columns except prop?
select(babynames, -prop)
## # A tibble: 1,924,665 × 4
## year sex name n
## <dbl> <chr> <chr> <int>
## 1 1880 F Mary 7065
## 2 1880 F Anna 2604
## 3 1880 F Emma 2003
## 4 1880 F Elizabeth 1939
## 5 1880 F Minnie 1746
## 6 1880 F Margaret 1578
## 7 1880 F Ida 1472
## 8 1880 F Alice 1414
## 9 1880 F Bertha 1320
## 10 1880 F Sarah 1288
## # … with 1,924,655 more rows
You can also use a series of helpers with select(). For example, if you place a minus sign before a column name, select() will return every column but that column. Can you predict how the minus sign will work here?
select(babynames, -c(n, prop))
## # A tibble: 1,924,665 × 3
## year sex name
## <dbl> <chr> <chr>
## 1 1880 F Mary
## 2 1880 F Anna
## 3 1880 F Emma
## 4 1880 F Elizabeth
## 5 1880 F Minnie
## 6 1880 F Margaret
## 7 1880 F Ida
## 8 1880 F Alice
## 9 1880 F Bertha
## 10 1880 F Sarah
## # … with 1,924,655 more rows
The table below summarizes the other select() helpers that are available in dplyr. Study it, and then click “Continue” to test your understanding.
| Helper Function | Use | Example |
|---|---|---|
| - | Columns except | select(babynames, -prop) |
| : | Columns between (inclusive) | select(babynames, year:n) |
| contains() | Columns that contains a string | select(babynames, contains(“n”)) |
| ends_with() | Columns that ends with a string | select(babynames, ends_with(“n”)) |
| matches() | Columns that matches a regex | select(babynames, matches(“n”)) |
| num_range() | Columns with a numerical suffix in the range | Not applicable with babynames |
| one_of() | Columns whose name appear in the given set | select(babynames, one_of(c(“sex”, “gender”))) |
| starts_with() | Columns that starts with a string | select(babynames, starts_with(“n”)) |
Which of these is not a way to select the name and n columns together?
filter() extracts rows from a data frame and returns them as a new data frame. As with select(), the first argument of filter() should be a data frame to extract rows from. The arguments that follow should be logical tests; filter() will return every row for which the tests return TRUE.
For example, the code chunk below returns every row with the name “Sea” in babynames.
filter(babynames, name == "Sea")
## # A tibble: 4 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1982 F Sea 5 0.00000276
## 2 1985 M Sea 6 0.00000312
## 3 1986 M Sea 5 0.0000026
## 4 1998 F Sea 5 0.00000258
To get the most from filter, you will need to know how to use R’s logical test operators, which are summarised below.
| Logical operator | tests | Example |
|---|---|---|
| > | Is x greater than y? | x > y |
| >= | Is x greater than or equal to y? | x >= y |
| < | Is x less than y? | x < y |
| <= | Is x less than or equal to y? | x <= y |
| == | Is x equal to y? | x == y |
| != | Is x not equal to y? | x != y |
| is.na() | Is x an NA? | is.na(x) |
| !is.na() | Is x not an NA? | !is.na(x) |
See if you can use the logical operators to manipulate our code below to show:
filter(babynames, name == "Sea")
## # A tibble: 4 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1982 F Sea 5 0.00000276
## 2 1985 M Sea 6 0.00000312
## 3 1986 M Sea 5 0.0000026
## 4 1998 F Sea 5 0.00000258
When you use logical tests, be sure to look out for two common mistakes. One appears in each code chunk below. Can you find them? When you spot a mistake, fix it and then run the chunk to confirm that it works.
filter(babynames, name = "Sea")
filter(babynames, name == "Sea")
## # A tibble: 4 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1982 F Sea 5 0.00000276
## 2 1985 M Sea 6 0.00000312
## 3 1986 M Sea 5 0.0000026
## 4 1998 F Sea 5 0.00000258
"Good Job! Remember to use == instead of = when testing for equality."
filter(babynames, name == Sea)
filter(babynames, name == "Sea")
## # A tibble: 4 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1982 F Sea 5 0.00000276
## 2 1985 M Sea 6 0.00000312
## 3 1986 M Sea 5 0.0000026
## 4 1998 F Sea 5 0.00000258
"Good Job! As written this code would check that name is equal to the contents of the object named Sea, which does not exist."
When you use logical tests, be sure to look out for these two common mistakes:
If you provide more than one test to filter(), filter() will combine the tests with an and statement (&): it will only return the rows that satisfy all of the tests.
To combine multiple tests in a different way, use R’s Boolean operators. For example, the code below will return all of the children named Sea or Anemone.
filter(babynames, name == "Sea" | name == "Anemone")
## # A tibble: 5 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1982 F Sea 5 0.00000276
## 2 1985 M Sea 6 0.00000312
## 3 1986 M Sea 5 0.0000026
## 4 1998 F Sea 5 0.00000258
## 5 2012 F Anemone 6 0.0000031
You can find a complete list or base R’s boolean operators in the table below.
| Boolean operator | represents | Example |
|---|---|---|
| & | Are both A and B true? | A & B |
| Are one or both of A and B true? | A | |
| ! | Is A not true? | !A |
| xor() | Is one and only one of A and B true? | xor(A, B) |
| %in% | Is x in the set of a, b, and c? | x %in% c(a, b, c) |
| any() | Are any of A, B, or C true? | any(A, B, C) |
| all() | Are all of A, B, or C true? | all(A, B, C) |
Use Boolean operators to alter the code chunk below to return only the rows that contain:
filter(babynames, name == "Sea" | name == "Anemone")
## # A tibble: 5 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1982 F Sea 5 0.00000276
## 2 1985 M Sea 6 0.00000312
## 3 1986 M Sea 5 0.0000026
## 4 1998 F Sea 5 0.00000258
## 5 2012 F Anemone 6 0.0000031
# Girls named Sea
filter(babynames, sex == "F", name == "Sea")
## # A tibble: 2 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1982 F Sea 5 0.00000276
## 2 1998 F Sea 5 0.00000258
# Names that were used by exactly 5 or 6 children in 1880
filter(babynames, n %in% c(5,6))
## # A tibble: 460,006 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Abby 6 0.0000615
## 2 1880 F Aileen 6 0.0000615
## 3 1880 F Alba 6 0.0000615
## 4 1880 F Alda 6 0.0000615
## 5 1880 F Alla 6 0.0000615
## 6 1880 F Alverta 6 0.0000615
## 7 1880 F Ara 6 0.0000615
## 8 1880 F Ardelia 6 0.0000615
## 9 1880 F Ardella 6 0.0000615
## 10 1880 F Arrie 6 0.0000615
## # … with 459,996 more rows
# Names that are one of Acura, Lexus, or Yugo
filter(babynames, name %in% c("Acura", "Lexus", "Yugo"))
## # A tibble: 57 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1990 F Lexus 36 0.0000175
## 2 1990 M Lexus 12 0.00000558
## 3 1991 F Lexus 102 0.0000502
## 4 1991 M Lexus 16 0.00000755
## 5 1992 F Lexus 193 0.0000963
## 6 1992 M Lexus 25 0.0000119
## 7 1993 F Lexus 285 0.000145
## 8 1993 M Lexus 30 0.0000145
## 9 1994 F Lexus 381 0.000195
## 10 1994 F Acura 6 0.00000308
## # … with 47 more rows
Logical tests also invite two common mistakes that you should look out for. Each is displayed in a code chunk below, one produces an error and the other is needlessly verbose. Diagnose the chunks and then fix the code.
filter(babynames, 10 < n < 20)
filter(babynames, 10 < n, n < 20)
## # A tibble: 365,458 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Antoinette 19 0.000195
## 2 1880 F Clementine 19 0.000195
## 3 1880 F Edythe 19 0.000195
## 4 1880 F Harriette 19 0.000195
## 5 1880 F Libbie 19 0.000195
## 6 1880 F Lilian 19 0.000195
## 7 1880 F Lue 19 0.000195
## 8 1880 F Lutie 19 0.000195
## 9 1880 F Magdalena 19 0.000195
## 10 1880 F Meda 19 0.000195
## # … with 365,448 more rows
"Good job! You cannot combine two logical tests in R without using a Boolean operator (or at least a comma between filter arguments)."
filter(babynames, n == 5 | n == 6 | n == 7 | n == 8 | n == 9)
filter(babynames, n %in% 5:9)
## # A tibble: 811,195 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Adela 9 0.0000922
## 2 1880 F Althea 9 0.0000922
## 3 1880 F Amalia 9 0.0000922
## 4 1880 F Amber 9 0.0000922
## 5 1880 F Angelina 9 0.0000922
## 6 1880 F Annabelle 9 0.0000922
## 7 1880 F Anner 9 0.0000922
## 8 1880 F Arie 9 0.0000922
## 9 1880 F Clarice 9 0.0000922
## 10 1880 F Corda 9 0.0000922
## # … with 811,185 more rows
"Good job! Although the first code works, you should make your code more concise by collapsing multiple or statements into an %in% statement when possible."
When you combine multiple logical tests, be sure to look out for these two common mistakes:
arrange() returns all of the rows of a data frame reordered by the values of a column. As with select(), the first argument of arrange() should be a data frame and the remaining arguments should be the names of columns. If you give arrange() a single column name, it will return the rows of the data frame reordered so that the row with the lowest value in that column appears first, the row with the second lowest value appears second, and so on. If the column contains character strings, arrange() will place them in alphabetical order.
Use the code chunk below to arrange babynames by n. Can you tell what the smallest value of n is?
arrange(babynames, n)
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Adelle 5 0.0000512
## 2 1880 F Adina 5 0.0000512
## 3 1880 F Adrienne 5 0.0000512
## 4 1880 F Albertine 5 0.0000512
## 5 1880 F Alys 5 0.0000512
## 6 1880 F Ana 5 0.0000512
## 7 1880 F Araminta 5 0.0000512
## 8 1880 F Arthur 5 0.0000512
## 9 1880 F Birtha 5 0.0000512
## 10 1880 F Bulah 5 0.0000512
## # … with 1,924,655 more rows
"Good job! The compiler of `babynames` used 5 as a cutoff; a name only made it into babynames for a given year and gender if it was used for five or more children."
If you supply additional column names, arrange() will use them as tie breakers to order rows that have identical values in the earlier columns. Add to the code below, to make prop a tie breaker. The result should first order rows by value of n and then reorder rows within each value of n by values of prop.
arrange(babynames, n)
arrange(babynames, n, prop)
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2007 M Aaban 5 0.00000226
## 2 2007 M Aareon 5 0.00000226
## 3 2007 M Aaris 5 0.00000226
## 4 2007 M Abd 5 0.00000226
## 5 2007 M Abdulazeez 5 0.00000226
## 6 2007 M Abdulhadi 5 0.00000226
## 7 2007 M Abdulhamid 5 0.00000226
## 8 2007 M Abdulkadir 5 0.00000226
## 9 2007 M Abdulraheem 5 0.00000226
## 10 2007 M Abdulrahim 5 0.00000226
## # … with 1,924,655 more rows
If you would rather arrange rows in the opposite order, i.e. from large values to small values, surround a column name with desc(). arrange() will reorder the rows based on the largest values to the smallest.
Add a desc() to the code below to display the most popular name for 2017 (the largest year in the dataset) instead of 1880 (the smallest year in the dataset).
arrange(babynames, year, desc(prop))
arrange(babynames, desc(year), desc(n))
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2017 F Emma 19738 0.0105
## 2 2017 M Liam 18728 0.00954
## 3 2017 F Olivia 18632 0.00994
## 4 2017 M Noah 18326 0.00933
## 5 2017 F Ava 15902 0.00848
## 6 2017 F Isabella 15100 0.00805
## 7 2017 M William 14904 0.00759
## 8 2017 F Sophia 14831 0.00791
## 9 2017 M James 14232 0.00725
## 10 2017 M Logan 13974 0.00712
## # … with 1,924,655 more rows
Think you have it? Click Continue to test yourself.
Which name was the most popular for a single gender in a single year? In the code chunk below, use arrange() to make the row with the largest value of prop appear at the top of the data set.
arrange(babynames, desc(prop))
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 M John 9655 0.0815
## 2 1881 M John 8769 0.0810
## 3 1880 M William 9532 0.0805
## 4 1883 M John 8894 0.0791
## 5 1881 M William 8524 0.0787
## 6 1882 M John 9557 0.0783
## 7 1884 M John 9388 0.0765
## 8 1882 M William 9298 0.0762
## 9 1886 M John 9026 0.0758
## 10 1885 M John 8756 0.0755
## # … with 1,924,655 more rows
arrange(babynames, desc(n))
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1947 F Linda 99686 0.0548
## 2 1948 F Linda 96209 0.0552
## 3 1947 M James 94756 0.0510
## 4 1957 M Michael 92695 0.0424
## 5 1947 M Robert 91642 0.0493
## 6 1949 F Linda 91016 0.0518
## 7 1956 M Michael 90620 0.0423
## 8 1958 M Michael 90520 0.0420
## 9 1948 M James 88588 0.0497
## 10 1954 M Michael 88514 0.0428
## # … with 1,924,655 more rows
"The number of children represented by each proportion grew over time as the population grew."
Notice how each dplyr function takes a data frame as input and returns a data frame as output. This makes the functions easy to use in a step by step fashion. For example, you could:
boys_2017 <- filter(babynames, year == 2017, sex == "M")
boys_2017 <- select(boys_2017, name, n)
boys_2017 <- arrange(boys_2017, desc(n))
boys_2017
## # A tibble: 14,160 × 2
## name n
## <chr> <int>
## 1 Liam 18728
## 2 Noah 18326
## 3 William 14904
## 4 James 14232
## 5 Logan 13974
## 6 Benjamin 13733
## 7 Mason 13502
## 8 Elijah 13268
## 9 Oliver 13141
## 10 Jacob 13106
## # … with 14,150 more rows
The result shows us the most popular boys names from 2017, which is the most recent year in the data set. But take a look at the code. Do you notice how we re-create boys_2017 at each step so we will have something to pass to the next step? This is an inefficient way to write R code.
You could avoid creating boys_2017 by nesting your functions inside of each other, but this creates code that is hard to read:
arrange(select(filter(babynames, year == 2017, sex == "M"), name, n), desc(n))
## # A tibble: 14,160 × 2
## name n
## <chr> <int>
## 1 Liam 18728
## 2 Noah 18326
## 3 William 14904
## 4 James 14232
## 5 Logan 13974
## 6 Benjamin 13733
## 7 Mason 13502
## 8 Elijah 13268
## 9 Oliver 13141
## 10 Jacob 13106
## # … with 14,150 more rows
The dplyr package provides a third way to write sequences of functions: the pipe.
The pipe operator %>% performs an extremely simple task: it passes the result on its left into the first argument of the function on its right. Or put another way, x %>% f(y) is the same as f(x, y). This piece of code punctuation makes it easy to write and read series of functions that are applied in a step by step way. For example, we can use the pipe to rewrite our code above:
babynames %>%
filter(year == 2017, sex == "M") %>%
select(name, n) %>%
arrange(desc(n))
## # A tibble: 14,160 × 2
## name n
## <chr> <int>
## 1 Liam 18728
## 2 Noah 18326
## 3 William 14904
## 4 James 14232
## 5 Logan 13974
## 6 Benjamin 13733
## 7 Mason 13502
## 8 Elijah 13268
## 9 Oliver 13141
## 10 Jacob 13106
## # … with 14,150 more rows
As you read the code, pronounce %>% as “then”. You’ll notice that dplyr makes it easy to read pipes. Each function name is a verb, so our code resembles the statement, “Take babynames, then filter it by name and sex, then select the name and n columns, then arrange the results by descending values of n.”
dplyr also makes it easy to write pipes. Each dplyr function returns a data frame that can be piped into another dplyr function, which will accept the data frame as its first argument. In fact, dplyr functions are written with pipes in mind: each function does one simple task. dplyr expects you to use pipes to combine these simple tasks to produce sophisticated results.
I’ll use pipes for the remainder of the tutorial, and I will expect you to as well. Let’s practice a little by writing a new pipe in the chunk below. The pipe should:
Try to write your pipe without copying and pasting the code from above.
babynames %>%
filter(year == 2017, sex == "F") %>%
select(name, n) %>%
arrange(desc(n))
## # A tibble: 18,309 × 2
## name n
## <chr> <int>
## 1 Emma 19738
## 2 Olivia 18632
## 3 Ava 15902
## 4 Isabella 15100
## 5 Sophia 14831
## 6 Mia 13437
## 7 Charlotte 12893
## 8 Amelia 11800
## 9 Evelyn 10675
## 10 Abigail 10551
## # … with 18,299 more rows
You’ve now mastered a set of skills that will let you easily plot the popularity of your name over time. In the code chunk below, use a combination of dplyr and ggplot2 functions with %>% to:
Note that the first argument of ggplot() takes a data frame, which means you can add ggplot() directly to the end of a pipe. However, you will need to switch from %>% to + to finish adding layers to your plot.
babynames %>%
filter(name == "John", sex == "M") %>%
select(year, prop) %>%
ggplot() +
geom_line(aes(x = year, y = prop))
#### Recap
Together, select(), filter(), and arrange() let you quickly find information displayed within your data.
The next tutorial will show you how to derive information that is implied by your data, but not displayed within your data set.
In that tutorial, you will continue to use the %>% operator, which is an essential part of programming with the dplyr library.
Pipes help make R expressive, like a spoken language. Spoken languages consist of simple words that you combine into sentences to create sophisticated thoughts.
In the tidyverse, functions are like words: each does one simple task well. You can combine these tasks into pipes with %>% to perform complex, customized procedures.
Data sets contain more information than they display, and this tutorial will show you how to access that information. You’ll learn to derive new variables and to compute groupwise summary statistics.
In this case study, you will identify the most popular American names from 1880 to 2015. While doing this, you will master three more dplyr functions:
These are some of the most useful R functions for data science, and this tutorial provides everything you need to learn them.
This tutorial uses the core tidyverse packages, including ggplot2, tibble, and dplyr, as well as the babynames package. All of these packages have been pre-installed and pre-loaded for your convenience.
Click the Next Topic button to begin.
Let’s use babynames to anwser a different question: what are the most popular names of all time?
This question seems simple enough, but to answer it we need to be more precise: how do you define “the most popular” names? Try to think of several definitions and then click Continue. After the Continue button, I will suggest two definitions of my own.
I suggest that we focus on two definitions of popular, one that uses sums and one that uses ranks:
This raises a question:
Do we have enough information in babynames to compare the popularity of names?
Every data frame that you meet implies more information than it displays. For example, babynames does not display the total number of children who had your name, but babynames certainly implies what that number is. To discover the number, you only need to do a calculation:
babynames %>%
filter(name == "Garrett", sex == "M") %>%
summarise(total = sum(n))
## # A tibble: 1 × 1
## total
## <int>
## 1 129759
dplyr provides three functions that can help you reveal the information implied by your data:
Like select(), filter() and arrange(), these functions all take a data frame as their first argument and return a new data frame as their output, which makes them easy to use in pipes.
Let’s master each function and use them to analyze popularity as we go.
summarise() takes a data frame and uses it to calculate a new data frame of summary statistics.
To use summarise(), pass it a data frame and then one or more named arguments. Each named argument should be set to an R expression that generates a single value. Summarise will turn each named argument into a column in the new data frame. The name of each argument will become the column name, and the value returned by the argument will become the column contents.
I used summarise() above to calculate the total number of boys named “Garrett”, but let’s expand that code to also calculate
babynames %>%
filter(name == "Garrett", sex == "M") %>%
summarise(total = sum(n), max = max(n), mean = mean(n))
## # A tibble: 1 × 3
## total max mean
## <int> <int> <dbl>
## 1 129759 5840 940.
Don’t let the code above fool you. The first argument of summarise()
is always a data frame, but when you use summarise() in a pipe, the
first argument is provided by the pipe operator, %>%. Here the first
argument will be the data frame that is returned by
babynames %>% filter(name == "Garrett", sex == "M").
Use the code chunk below to compute three statistics:
If you cannot think of an R function that would compute each statistic, click the Hint/Solution button.
babynames %>%
filter(name == "John", sex == "M") %>%
summarise(total = sum(n), max = max(n), mean = mean(n))
## # A tibble: 1 × 3
## total max mean
## <int> <int> <dbl>
## 1 5115466 88318 37069.
So far our summarise() examples have relied on sum(), max(), and mean(). But you can use any function in summarise() so long as it meets one criteria: the function must take a vector of values as input and return a single value as output. Functions that do this are known as summary functions and they are common in the field of descriptive statistics. Some of the most useful summary functions include:
Let’s apply some of these summary functions. Click Continue to test your understanding.
“Khaleesi” is a very modern name that appears to be based on the Game of Thrones TV series, which premiered on April 17, 2011. In the chunk below, filter babynames to just the rows where name == “Khaleesi”. Then use summarise() and a summary function to return the first value of year in the data set.
babynames %>%
filter(name == "Khaleesi") %>%
summarize(year = first(year))
## # A tibble: 1 × 1
## year
## <dbl>
## 1 2011
In the chunk below, use summarise() and a summary function to return a data frame with two columns:
Will these numbers be different? Why or why not?
babynames %>%
summarize(n(), distinct = n_distinct(name))
## # A tibble: 1 × 2
## `n()` distinct
## <int> <int>
## 1 1924665 97310
"Good job! The two numbers are different because most names appear in the data set more than once. They appear once for each year in which they were used."
How can we apply summarise() to find the most popular names in babynames? You’ve seen how to calculate the total number of children that have your name, which provides one of our measures of popularity, i.e. the total number of children that have a name:
babynames %>%
filter(name == "Garrett", sex == "M") %>%
summarise(total = sum(n))
However, we had to isolate your name from the rest of your data to calculate this number. You could imagine writing a program that goes through each name one at a time and:
Eventually, the program could combine all of the results back into a single data set. However, you don’t need to write such a program; this is the job of dplyr’s group_by() function.
group_by() takes a data frame and then the names of one or more columns in the data frame. It returns a copy of the data frame that has been “grouped” into sets of rows that share identical combinations of values in the specified columns.
For example, the result below is grouped into rows that have the same combination of year and sex values: boys in 1880 are treated as one group, girls in 1880 as another group and so on.
babynames %>%
group_by(year, sex)
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # … with 1,924,655 more rows
By itself, group_by() doesn’t do much. It assigns grouping criteria that is stored as metadata alongside the original data set. If your dataset is a tibble, as above, R will tell you that the data is grouped at the top of the tibble display. In all other aspects, the data looks the same.
However, when you apply a dplyr function like summarise() to grouped data, dplyr will execute the function in a groupwise manner. Instead of computing a single summary for the entire data set, dplyr will compute individual summaries for each group and return them as a single data frame. The data frame will contain the summary columns as well as the columns in the grouping criteria, which makes the result decipherable:
babynames %>%
group_by(year, sex) %>%
summarise(total = sum(n))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 276 × 3
## year sex total
## <dbl> <chr> <int>
## 1 1880 F 90993
## 2 1880 M 110491
## 3 1881 F 91953
## 4 1881 M 100743
## 5 1882 F 107847
## 6 1882 M 113686
## 7 1883 F 112319
## 8 1883 M 104627
## 9 1884 F 129020
## 10 1884 M 114442
## # … with 266 more rows
To understand exactly what group_by() is doing, remove the line group_by(year, sex) %>% from the code above and rerun it. How do the results change?
babynames %>%
summarise(total = sum(n))
## # A tibble: 1 × 1
## total
## <int>
## 1 348120517
If you apply summarise() to grouped data, summarise() will return data that is grouped in a similar, but not identical fashion. summarise() will remove the last variable in the grouping criteria, which creates a data frame that is grouped at a higher level. For example, this summarise() statement receives a data frame that is grouped by year and sex, but it returns a data frame that is grouped only by year.
babynames %>%
group_by(year, sex) %>%
summarise(total = sum(n))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 276 × 3
## year sex total
## <dbl> <chr> <int>
## 1 1880 F 90993
## 2 1880 M 110491
## 3 1881 F 91953
## 4 1881 M 100743
## 5 1882 F 107847
## 6 1882 M 113686
## 7 1883 F 112319
## 8 1883 M 104627
## 9 1884 F 129020
## 10 1884 M 114442
## # … with 266 more rows
If only one grouping variable is left in the grouping criteria, summarise() will return an ungrouped data set. This feature let’s you progressively “unwrap” a grouped data set:
If we add another summarise() to our pipe,
babynames %>%
group_by(year, sex) %>%
summarise(total = sum(n)) %>%
summarise(total = sum(total))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 138 × 2
## year total
## <dbl> <int>
## 1 1880 201484
## 2 1881 192696
## 3 1882 221533
## 4 1883 216946
## 5 1884 243462
## 6 1885 240854
## 7 1886 255317
## 8 1887 247394
## 9 1888 299473
## 10 1889 288946
## # … with 128 more rows
If you wish to manually remove the grouping criteria from a data set, you can do so with ungroup().
babynames %>%
group_by(year, sex) %>%
ungroup()
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # … with 1,924,655 more rows
And, you can override the current grouping information with a new call to group_by().
babynames %>%
group_by(year, sex) %>%
group_by(name)
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # … with 1,924,655 more rows
That’s it. Between group_by(), summarise(), and ungroup(), you have a toolkit for taking groupwise summaries of your data at various levels of grouping
You now know enough to calculate the most popular names by total children (it may take some strategizing, but you can do it!).
In the code chunk below, use group_by(), summarise(), and arrange() to display the ten most popular names. Compute popularity as the total number of children of a single gender given a name. In other words, the total number of boys named “Kelly” should be computed separately from the total number of girls named “Kelly”.
babynames %>%
group_by(name, sex) %>%
summarize(total = sum(n)) %>%
arrange(desc(total))
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
## # A tibble: 107,973 × 3
## name sex total
## <chr> <chr> <int>
## 1 James M 5150472
## 2 John M 5115466
## 3 Robert M 4814815
## 4 Michael M 4350824
## 5 Mary F 4123200
## 6 William M 4102604
## 7 David M 3611329
## 8 Joseph M 2603445
## 9 Richard M 2563082
## 10 Charles M 2386048
## # … with 107,963 more rows
Let’s examine how the popularity of popular names has changed over time. To help us, I’ve made top_10, which is a version of babynames that is trimmed down to just the ten most popular names from above.
## # A tibble: 1,380 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 M John 9655 0.0815
## 3 1880 M William 9532 0.0805
## 4 1880 M James 5927 0.0501
## 5 1880 M Charles 5348 0.0452
## 6 1880 M Joseph 2632 0.0222
## 7 1880 M Robert 2415 0.0204
## 8 1880 M David 869 0.00734
## 9 1880 M Richard 728 0.00615
## 10 1880 M Michael 354 0.00299
## # … with 1,370 more rows
Use the code block below to plot a line graph of prop vs year for each name in top_10. Be sure to color the lines by name to make the graph interpretable.
# See https://github.com/rstudio-education/primers/blob/master/transform-data/03-deriving/03-deriving.Rmd
tops <- babynames %>%
group_by(name, sex) %>%
summarise(total = sum(n)) %>%
ungroup() %>%
top_n(10, total)
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
tops
## # A tibble: 10 × 3
## name sex total
## <chr> <chr> <int>
## 1 Charles M 2386048
## 2 David M 3611329
## 3 James M 5150472
## 4 John M 5115466
## 5 Joseph M 2603445
## 6 Mary F 4123200
## 7 Michael M 4350824
## 8 Richard M 2563082
## 9 Robert M 4814815
## 10 William M 4102604
top_10 <- babynames::babynames %>%
semi_join(tops, by = c("name", "sex"))
top_10
## # A tibble: 1,380 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 M John 9655 0.0815
## 3 1880 M William 9532 0.0805
## 4 1880 M James 5927 0.0501
## 5 1880 M Charles 5348 0.0452
## 6 1880 M Joseph 2632 0.0222
## 7 1880 M Robert 2415 0.0204
## 8 1880 M David 869 0.00734
## 9 1880 M Richard 728 0.00615
## 10 1880 M Michael 354 0.00299
## # … with 1,370 more rows
# The following is my code created before I found the code above
top10 <- babynames %>%
group_by(name, sex) %>%
summarize(total = sum(n)) %>%
arrange(desc(total)) %>%
head(10)
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
# top10$name
my_top_10 <- babynames %>% filter((sex == "F" & name %in% filter(top10, sex == "F")$name) | (sex == "M" & name %in% filter(top10, sex == "M")$name))
identical(top_10, my_top_10)
## [1] TRUE
top_10 %>%
ggplot() +
geom_line(aes(x = year, y = prop, color = name))

Now use top_10 to plot n vs year for each of the names. How are the plots different? Why might that be? How does this affect our decision to use total children as a measure of popularity?
top_10 %>%
ggplot() +
geom_line(aes(x = year, y = n, color = name))

"Good job! This graph shows different trends than the one above, now let's consider why."
Why might there be a difference between the proportion of children who receive a name over time, and the number of children who receive the name?
An obvious culprit could be the total number of children born per year. If more children are born each year, the number of children who receive a name could grow even if the proportion of children given that name declines.
Test this theory in the chunk below. Use babynames and groupwise summaries to compute the total number of children born each year and then to plot that number vs. year in a line graph.
babynames %>%
group_by(year) %>%
summarize(total = sum(n)) %>%
ggplot() +
geom_line(aes(x = year, y = total))
#### Popularity based on rank
The graph above suggests that our first definition of popularity is confounded with population growth: the most popular names in 2015 likely represent far more children than the most popular names in 1880. The total number of children given a name may still be the best definition of popularity to use, but it will overweight names that have been popular in recent years.
There is also evidence that our definition is confounded with a gender effect: only one of the top ten names was a girl’s name.
If you are concerned about these things, you might prefer to use our second definition of popularity, which would give equal representation to each year and gender:
To use this definition, we could:
To do this, we will need to learn one last dplyr function.
mutate() uses a data frame to compute new variables. It then returns a copy of the data frame that includes the new variables. For example, we can use mutate() to compute a percent variable for babynames. Here percent is just the prop multiplied by 100 and rounded to two decimal places.
babynames %>%
mutate(percent = round(prop * 100, 2))
## # A tibble: 1,924,665 × 6
## year sex name n prop percent
## <dbl> <chr> <chr> <int> <dbl> <dbl>
## 1 1880 F Mary 7065 0.0724 7.24
## 2 1880 F Anna 2604 0.0267 2.67
## 3 1880 F Emma 2003 0.0205 2.05
## 4 1880 F Elizabeth 1939 0.0199 1.99
## 5 1880 F Minnie 1746 0.0179 1.79
## 6 1880 F Margaret 1578 0.0162 1.62
## 7 1880 F Ida 1472 0.0151 1.51
## 8 1880 F Alice 1414 0.0145 1.45
## 9 1880 F Bertha 1320 0.0135 1.35
## 10 1880 F Sarah 1288 0.0132 1.32
## # … with 1,924,655 more rows
The syntax of mutate is similar to summarise(). mutate() takes first a data frame, and then one or more named arguments that are set equal to R expressions. mutate() turns each named argument into a column. The name of the argument becomes the column name and the result of the R expression becomes the column contents.
Use mutate() in the chunk below to create a births column, the result of dividing n by prop. You can think of births as a sanity check; it uses each row to double check the number of boys or girls that were born each year. If all is well, the numbers will agree across rows (allowing for rounding errors).
babynames %>%
mutate(births = n / prop)
## # A tibble: 1,924,665 × 6
## year sex name n prop births
## <dbl> <chr> <chr> <int> <dbl> <dbl>
## 1 1880 F Mary 7065 0.0724 97605.
## 2 1880 F Anna 2604 0.0267 97605.
## 3 1880 F Emma 2003 0.0205 97605.
## 4 1880 F Elizabeth 1939 0.0199 97605.
## 5 1880 F Minnie 1746 0.0179 97605.
## 6 1880 F Margaret 1578 0.0162 97605.
## 7 1880 F Ida 1472 0.0151 97605.
## 8 1880 F Alice 1414 0.0145 97605.
## 9 1880 F Bertha 1320 0.0135 97605.
## 10 1880 F Sarah 1288 0.0132 97605.
## # … with 1,924,655 more rows
Like summarise(), mutate() works in combination with a specific type of function. summarise() expects summary functions, which take vectors of input and return single values. mutate() expects vectorized functions, which take vectors of input and return vectors of values.
In other words, summary functions like min() and max() won’t work well with mutate(). You can see why if you take a moment to think about what mutate() does: mutate() adds a new column to the original data set. In R, every column in a dataset must be the same length, so mutate() must supply as many values for the new column as there are in the existing columns.
If you give mutate() an expression that returns a single value, it will follow R’s recycling rules and repeat that value as many times as needed to fill the column. This can make sense in some cases, but the reverse is never true: you cannot give summarise() a vectorized function; summarise() needs its input to return a single value.
What are some of R’s vectorized functions? Click Continue to find out.
Some of the most useful vectorised functions in R to use with mutate() include:
For ranking, I recommend that you use min_rank(), which gives the smallest values the top ranks. To rank in descending order, use the familiar desc() function, e.g.
min_rank(c(50, 100, 1000))
## [1] 1 2 3
min_rank(desc(c(50, 100, 1000)))
## [1] 3 2 1
Let’s practice by ranking the entire dataset based on prop. In the chunk below, use mutate() and min_rank() to rank each row based on its prop value, with the highest values receiving the top ranks.
babynames %>%
mutate(rank = min_rank(desc(prop)))
## # A tibble: 1,924,665 × 6
## year sex name n prop rank
## <dbl> <chr> <chr> <int> <dbl> <int>
## 1 1880 F Mary 7065 0.0724 14
## 2 1880 F Anna 2604 0.0267 709
## 3 1880 F Emma 2003 0.0205 1131
## 4 1880 F Elizabeth 1939 0.0199 1192
## 5 1880 F Minnie 1746 0.0179 1427
## 6 1880 F Margaret 1578 0.0162 1683
## 7 1880 F Ida 1472 0.0151 1897
## 8 1880 F Alice 1414 0.0145 2039
## 9 1880 F Bertha 1320 0.0135 2279
## 10 1880 F Sarah 1288 0.0132 2387
## # … with 1,924,655 more rows
In the previous exercise, we assigned rankings across the entire data set. For example, with the exception of ties, there was only one 1 in the entire data set, only one 2, and so on. To calculate a popularity score across years, you will need to do something different: you will need to assign rankings within groups of year and sex. Now there will be one 1 in each group of year and sex.
To rank within groups, combine mutate() with group_by(). Like dplyr’s other functions, mutate() will treat grouped data in a group-wise fashion.
Add group_by() to our code from above, to calculate ranking within year and sex combinations. Do you notice the numbers change?
babynames %>%
group_by(year, sex) %>%
mutate(rank = min_rank(desc(prop)))
## # A tibble: 1,924,665 × 6
## year sex name n prop rank
## <dbl> <chr> <chr> <int> <dbl> <int>
## 1 1880 F Mary 7065 0.0724 1
## 2 1880 F Anna 2604 0.0267 2
## 3 1880 F Emma 2003 0.0205 3
## 4 1880 F Elizabeth 1939 0.0199 4
## 5 1880 F Minnie 1746 0.0179 5
## 6 1880 F Margaret 1578 0.0162 6
## 7 1880 F Ida 1472 0.0151 7
## 8 1880 F Alice 1414 0.0145 8
## 9 1880 F Bertha 1320 0.0135 9
## 10 1880 F Sarah 1288 0.0132 10
## # … with 1,924,655 more rows
group_by() provides the missing piece for calculating our second measure of popularity. In the code chunk below,
babynames %>%
group_by(year, sex) %>%
mutate(rank = min_rank(desc(prop))) %>%
ungroup() %>%
group_by(name, sex) %>%
summarize(median_ranking = median(rank)) %>%
arrange(median_ranking)
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
## # A tibble: 107,973 × 3
## name sex median_ranking
## <chr> <chr> <dbl>
## 1 Mary F 1
## 2 James M 3
## 3 John M 3
## 4 William M 4
## 5 Robert M 6
## 6 Michael M 7.5
## 7 Charles M 9
## 8 Elizabeth F 10
## 9 Joseph M 10
## 10 Thomas M 11
## # … with 107,963 more rows
In this primer, you learned three functions for isolating data within a table:
You also learned three functions for deriving new data from a table:
Together these six functions create a grammar of data manipulation, a system of verbs that you can use to manipulate data in a sophisticated, step-by-step way. These verbs target the everyday tasks of data analysis. No matter which types of data you work with, you will discover that:
The six dplyr functions help you work with these realities by isolating and revealing the information contained in your data. In fact, dplyr provides more than six functions for this grammar: dplyr comes with several functions that are variations on the themes of select(), filter(), summarise(), and mutate(). Each follows the same pipeable syntax that is used throughout dplyr. If you are interested, you can learn more about these peripheral functions in the dplyr cheatsheet.
Apply your knowledge of dplyr to do the following two challenges.
How many distinct boys names acheived a rank of Number 1 in any year?
top_male <- babynames %>%
group_by(year, sex) %>%
mutate(rank = min_rank(desc(n))) %>%
filter(rank == 1, sex == "M")
unique(top_male$name)
## [1] "John" "Robert" "James" "Michael" "David" "Jacob" "Noah"
## [8] "Liam"
babynames %>%
group_by(year, sex) %>%
mutate(rank = min_rank(desc(n))) %>%
filter(rank == 1, sex == "M") %>%
ungroup() %>%
summarise(distinct = n_distinct(name))
## # A tibble: 1 × 1
## distinct
## <int>
## 1 8
babynames %>%
group_by(year, sex) %>%
mutate(rank = min_rank(desc(n))) %>%
filter(rank == 1, sex == "M") %>%
ungroup() %>%
group_by(name) %>%
summarise(distinct = n_distinct(year)) %>%
arrange(desc(distinct))
## # A tibble: 8 × 2
## name distinct
## <chr> <int>
## 1 John 44
## 2 Michael 44
## 3 Robert 17
## 4 Jacob 14
## 5 James 13
## 6 Noah 4
## 7 David 1
## 8 Liam 1
How many distinct girls names acheived a rank of Number 1 in any year?
babynames %>%
group_by(year, sex) %>%
mutate(rank = min_rank(desc(n))) %>%
filter(rank == 1, sex == "F") %>%
ungroup() %>%
summarise(distinct = n_distinct(name))
## # A tibble: 1 × 1
## distinct
## <int>
## 1 10
babynames %>%
group_by(year, sex) %>%
mutate(rank = min_rank(desc(n))) %>%
filter(rank == 1, sex == "F") %>%
ungroup() %>%
group_by(name) %>%
summarise(distinct = n_distinct(year)) %>%
arrange(desc(distinct))
## # A tibble: 10 × 2
## name distinct
## <chr> <int>
## 1 Mary 76
## 2 Jennifer 15
## 3 Emily 12
## 4 Jessica 9
## 5 Lisa 8
## 6 Linda 6
## 7 Emma 5
## 8 Sophia 3
## 9 Ashley 2
## 10 Isabella 2
number_ones is a vector of every boys name to acheive a rank of one.
number_ones ## [1] “John” “Robert” “James” “Michael” “David” “Jacob”
“Noah”
## [8] “Liam” Use number_ones with babynames to recreate the plot below,
which shows the popularity over time for every name in number_ones.
image
Which gender uses more names?
In the chunk below, calculate and then plot the number of distinct names used each year for boys and girls. Place year on the x axis, the number of distinct names on they y axis and color the lines by sex.
babynames %>%
group_by(year, sex) %>%
summarize(distinct_names = n_distinct(name)) %>%
ggplot() +
geom_line(aes(x = year, y = distinct_names, color = sex))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

What about the code below? Are these same?
babynames %>%
group_by(year, sex) %>%
mutate(distinct_names = n_distinct(name)) %>%
ggplot() +
geom_line(aes(x = year, y = distinct_names, color = sex))

Let’s make sure that we’re not confounding our search with the total number of boys and girls born each year. With the chunk below, calculate and then plot over time the total number of boys and girls by year. Is the relative number of boys and girls constant?
babynames %>%
group_by(year, sex) %>%
summarize(total = sum(n)) %>%
ggplot() +
geom_line(aes(x = year, y = total, color = sex))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
#### Name Diversity Challenge - children per name
Hmm. Sometimes there are more girls and sometimes more boys. In addition, the entire population has been grown over time. Let’s account for this weith a new metric: the average number of children per name.
If girls have a smaller number of children per name, that would imply that they use more names overall (and vice versa).
In the chunk below, calculate and plot the average number of children per name by year and sex over time. How do you interpret the results?
babynames %>%
group_by(year, sex) %>%
summarize(average = mean(n)) %>%
ggplot() +
geom_line(aes(x = year, y = average, color = sex))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

Congratulations! You can use dplyr’s grammar of data manipulation to access any data associated with a table—even if that data is not currently displayed by the table.
In other words, you now know how to look at data in R, as well as how to access specific values, calculate summary statistics, and compute new variables. When you combine this with the visualization skills that you learned in Visualization Basics, you have everything that you need to begin exploring data in R.
The next tutorial will teach you the last of three basic skills for working with R:
Learn how to use ggplot2 to make any type of plot with your data. Then learn the best ways to visualize patterns within values and relationships between variables.
If you’re ready to begin, go to the first tutorial. There is no need to install or download anything. Each tutorial has everything you need to write and run R code, right in the tutorial.
Start here to learn how to explore your data with visualizations, using a strategy known as Exploratory Data Analysis (EDA).
This tutorial will show you how to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. In the tutorial you will:
The tutorial is excerpted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA, you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on lines of inquiry that reveal insights worth writing up and communicating to others.
Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
EDA is, fundamentally, a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will highlight a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.
“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
The rest of this tutorial will look at these two questions. To make the discussion easier, let’s define some terms…
You can think of science as a process with two steps: discovery and confirmation. Scientists first observe the world to discover a hypothesis to test. Then, they devise a test to confirm the hypotheses against new data. If a hypothesis survives many tests, scientists begin to trust that it is a reliable explanation of the data.
The separation between discovery and confirmation is especially important for data scientists. It is easy for patterns to appear in data by coincidence. As a result, data scientists first look for patterns, and then try to confirm that the patterns exist in the real world. Sometimes this confirmation requires computing the probability that the pattern is due to random chance, a task that often involves collecting new data.
Is EDA a tool for discovery or confirmation?
Correct!
EDA is a tool for discovery; in fact, EDA is one of the most fruitful tools for discovery in science. We'll focus on discovery throughout this primer, but remember that you should test any pattern that you discover before you rely on it.
When you begin to explore data, is it better to formulate one or two high-quality questions to ask, or many, many questions to explore?
Correct!
Each question you ask creates a new opportunity to discover something surprising. You can lead yourself to high-value questions by iterating on questions that reveal unexpected results.
iris is a famous toy data set that comes with R. The
data set describes 150 iris flowers. Each row in iris displays a
flower’s sepal and petal dimensions. You can use these measurements to
deduce the flower’s species, which is also displayed in
iris.
iris
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with 140 more rows
Which of these is a variable in the iris dataset?
Correct!
Which of these is a value in the iris dataset?
Correct!
Which of these is an observation in the iris dataset?
Correct!
These measurements were all collected under similar circumstances: on the same flower, presumably at the same time. If a relationship exists between the variables that these values describe, we would expect the relationship to also exist between these values.
Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice—and precisely enough—you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different objects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal useful information. The best way to understand that pattern is to visualise the distribution of the variable’s values. How you visualise the distribution of a variable will depend on whether the variable is categorical or continuous.
A variable is categorical if it can take only one of a small set of values. In R, categorical variables are usually saved as factors or character vectors. You can visualize the distribution of a categorical variable with a bar chart, like the one below.
Don’t worry if
you cannot make or interpret a bar chart. We’ll survey several types of
charts in this tutorial, as we create a strategy for EDA. You’ll learn
how to build each type of chart in the tutorials that follow.
A variable is continuous if it can take any of an infinite set of smooth, ordered values. Here, smooth means that if you order the values on a line, an infinite number of values would exist between any two points on the line. For example, an infinite number of values exists between 0 and 1, e.g. 0.9, 0.99, 0.999, and so on.
Numbers and date-times are two examples of continuous variables. You can visualize the distribution of a continuous variable with a histogram, like the one below:
####
Frequencies
In both bar charts and histograms, tall bars show the common values of a variable, i.e. the values that appear frequently. Shorter bars show less-common values, i.e. values that appear infrequently. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:
Many of the questions above will prompt you to explore a relationship between variables, to see if the values of one variable can explain the values of another variable. We’ll get to that shortly.
The bar chart below visualises the distribution of the class variable in the mpg data set, which comes in the ggplot2 package. The height of the bars reveal how many cars in the data set come from each class.
The
distribution of class in mpg
What is the most common type of car in the mpg data set?
What is the least common type of car in the mpg data set?
Correct!
Does the distribution of cars in the mpg dataset seem to reflect the distribution of cars that you see on the road? Would your answer shape how you use this data?
Correct!
For continuous variables, clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
The histogram below shows the distribution of the eruptions variable in the faithful data set, which comes with R. eruptions shows the lengths (in minutes) of 272 eruptions of the Old Faithful geyser in Yellowstone National Park.
To interpret the histogram, look first at the x axis, which displays the lengths of eruptions recorded in the data. The range of the x axis shows that the shortest eruptions lasted for about one minute and the longest for about five minutes.
To see how many eruptions lasted for a specific length of time, find the length of time on the x axis and then look at the height of the bar above the length of time. For example, according to the histogram, 30 eruptions lasted for about two minutes, but only three lasted for about three minutes (the height of the bar above two is 30, the height of the bar above three is three).
image
Do the eruption lengths cluster into groups? How many?
Eruption lengths appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but few eruptions in between.
If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on whether your variables are categorical or continuous.
If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on whether your variables are categorical or continuous.
Two categorical variables You can plot the relationship between two categorical variables with a heatmap or with geom_count:
image
image
Again, don’t be concerned if you do not know how to make these graphs. For now, let’s focus on the strategy of how to use visualizations in EDA. You’ll learn how to make different types of plots in the tutorials that follow.
You can plot the relationship between one continuous and one categorical variable with a boxplot:
#### Two
continuous variables
You can plot the relationship between two continuous variables with a scatterplot:
image
Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
Remember that clusters and outliers are also a type of pattern. Two dimensional plots can reveal clusters and outliers that would not be visible in a one dimensional plot. If you spot either, ask yourself what they imply.
The scatterplot below shows the relationship between the length of an eruption of Old Faithful and the wait time before the eruption (i.e. the amount of time that passed between it and the previous eruption).
Does
the scatterplot above reveal a pattern that helps to explain the
variation in lengths of Old Faithful eruptions?
Correct!
The data seems to suggest that a long build up before an eruption is associated with a long eruption. The plot also shows the two clusters that we saw before: there are long eruptions with a long build up and short eruptions with a short build up.
Patterns provide a useful tool for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. When two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), you can use the value of one variable to control the value of the second.
You’ve learned a lot in this tutorial. Here’s what you should keep with you:
Throughout the tutorial, you also encountered several recommendations for plots that visualize variation and covariation for categorical and continuous variables. Plots are a bit like questions in EDA: you should make many quickly and try anything that strikes your fancy. You can refine your plots later to share with others. A lot of refinement will occur naturally as you iterate during EDA.
The suggestions below can serve as starting point for visualizing data. In the tutorials that follow, you will learn how to make each type of plot, as well as how to use best practices and advanced skills when visualizing data.
image
Learn to make and customize bar charts, a device for visualizing the distribution of categorical variables. Here, you will also learn to use ggplot2 position adjustments and facetting.
This tutorial will show you how to make and enhance bar charts with the ggplot2 package. You will learn how to:
The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.
The tutorial uses the ggplot2 and dplyr packages, which have been pre-loaded for your convenience.
To make a bar chart with ggplot2, add geom_bar() to the ggplot2 template. For example, the code below plots a bar chart of the cut variable in the diamonds dataset, which comes with ggplot2.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

You should not supply a y aesthetic when you use geom_bar(); ggplot2 will count how many times each x value appears in the data, and then display the counts on the y axis. So, for example, the plot above shows that over 20,000 diamonds in the data set had a value of Ideal.
You can compute this information manually with the count() function from the dplyr package.
diamonds %>%
count(cut)
## # A tibble: 5 × 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
Sometimes, you may want to map the heights of the bars not to counts, but to a variable in the data set. To do this, use geom_col(), which is short for column.
ggplot(data = pressure) +
geom_col(mapping = aes(x = temperature, y = pressure))

When you use geom_col(), your x and y values should have a one to one relationship, as they do in the pressure data set (i.e. each value of temperature is paired with a single value of pressure).
pressure
## # A tibble: 19 × 2
## temperature pressure
## <dbl> <dbl>
## 1 0 0.0002
## 2 20 0.0012
## 3 40 0.006
## 4 60 0.03
## 5 80 0.09
## 6 100 0.27
## 7 120 0.75
## 8 140 1.85
## 9 160 4.2
## 10 180 8.8
## 11 200 17.3
## 12 220 32.1
## 13 240 57
## 14 260 96
## 15 280 157
## 16 300 247
## 17 320 376
## 18 340 558
## 19 360 806
Use the code chunk below to plot the distribution of the color variable in the diamonds data set, which comes in the ggplot2 package.
ggplot(data = diamonds) +
geom_bar(aes(x = color))

#### Bar
charts
What is the most common type of cut in the diamonds dataset?
Correct!
How many diamonds in the dataset had a Good cut?
Correct!
Diagnose the error below and then fix the code chunk to make a plot.
ggplot(data = pressure) +
geom_bar(mapping = aes(x = temperature, y = pressure))
ggplot(data = pressure) +
geom_col(mapping = aes(x = temperature, y = pressure))

Recreate the bar graph of color from exercise one, but this time first use count() to manually compute the heights of the bars. Then use geom_col() to plot the results as a bar graph. Does your graph look the same as in exercise one?
diamonds %>%
count(color) %>%
ggplot() +
geom_col(aes(x = color, y = n))

The following create a table.
diamonds %>%
count(color)
## # A tibble: 7 × 2
## color n
## <ord> <int>
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
geom_bar() and geom_col() can use several aesthetics:
One of these, color, creates the most surprising results. Predict what the code below will return and then run it.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, color = cut))

The color aesthetic controls the outline of each bar in your bar plot, which may not be what you want. To color the interior of each bar, use the fill aesthetic:
Use the code chunk
below to experiment with fill, along with other geom_bar() aesthetics,
like alpha, linetype, and size.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, alpha = 0.5))

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), width = 1)

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), width = 0.8)

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), width = 0.5)

Notice that width is a parameter, not an aesthetic mapping. Hence, you should set width outside of the aes() function.
Create a colored bar chart of the class variable from the mpg data set, which comes with ggplot2. Map the interior color of each bar to class.
ggplot(data = mpg) +
geom_bar(aes(class, fill = class))

If you map fill to a new variable, geom_bar() will display a stacked bar chart:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))

This plot displays 40 different combinations of cut and clarity, each displayed by its own rectangle. geom_bar() lays out the rectangles by stacking rectangles that have the same cut value on top of one another. You can change this behavior with a position adjustment.
To place rectangles that have the same cut value beside each other, set position = “dodge”.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
This plot shows the same rectangles as the previous chart; however, it
lays out rectangles that have the same cut value beside each other.
To create the familiar stacked bar chart, set position = “stack” (which is the default for geom_bar()).
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")

To expand each bar to take up the entire y axis, set position = “fill”. ggplot2 will stack the rectangles and then scale them within each bar.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
This makes it easy to compare proportions. For example, you can scan
across the bars to see how the proportion of IF diamonds changes from
cut to cut.
` #### What is a position adjustment?
Every geom function in ggplot2 takes a position argument that is preset to a reasonable default. You can use position to determine how a geom should adjust objects that would otherwise overlap with each other.
For example, in our plot, each value of cut is associated with eight rectangles: one each for I1, SI2, SI1, VS2, VS1, VVS2, VVS1, and IF. Each of these eight rectangles deserves to go in the same place: directly above the value of cut that it is associated with, with the bottom of the rectangle placed at count = 0. But if we plotted the plot like that, the rectangles would overlap each other.
Here’s what that would look like if you could peek around the side of the graph.
image
..and here’s what that would look like if you could see the graph from the front. You can make this plot by setting position = “identity”.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity")

Position adjustments tell ggplot2 how to re-distribute objects when they overlap. position = “identity” is the “adjustment” that let’s objects overlap each other. It is a bad choice for bar graphs because the result looks like a stacked bar chart, even though it is not.
Use the code chunk to recreate the plot you see below. Remember: color is the name of a variable in diamonds (not to be confused with an aesthetic).
image
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = color, fill = clarity), position = "fill", width = 1)

Use the code chunk to recreate the plot you see below. Remember: color is the name of a variable in diamonds (not to be confused with an aesthetic).
image
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = color, fill = cut), position = "dodge")

#### Why is
position = “identity” a bad idea?
Suppose the graph above uses position = “stack”. About how many diamonds have an ideal cut and a G color?
Correct!
In a stacked bar chart, you can calculate the number of observations in each bar by subtracting the y value at the bottom of the bar from the y value at the top.
Suppose the graph above uses position = “identity”. About how many diamonds have an ideal cut and a G color?
Correct!
Here the green bar extends all the way from 5000 to 0; most of the bar is behind the blue, purple, and magenta bars. In practice, you would never construct a bar chart like this.
You can more easily compare subgroups of data if you place each subgroup in its own subplot, a process known as facetting.
####
facet_grid()
ggplot2 provides two functions for facetting. facet_grid() divides the plot into a grid of subplots based on the values of one or two facetting variables. To use it, add facet_grid() to the end of your plot call.
The code chunks below, show three ways to facet with facet_grid(). Spot the differences between the chunks, then run the code to learn what the differences do.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = color)) +
facet_grid(clarity ~ cut)

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = color)) +
facet_grid(. ~ cut)

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = color)) +
facet_grid(clarity ~ .)

As you saw in the code examples, you use facet_grid() by passing it a formula, the names of two variables connected by a ~.
facet_grid() will split the plot into facets vertically by the values of the first variable: each facet will contain the observations that have a common value of the variable. facet_grid() will split the plot horizontally by values of the second variable. The result is a grid of facets, where each specific subplot shows a specific combination of values.
If you do not wish to split on the vertical or horizontal dimension, pass facet_grid() a . instead of a variable name as a place holder.
facet_wrap() provides a more relaxed way to facet a plot on a single variable. It will split the plot into subplots and then reorganize the subplots into multiple rows so that each plot has a more or less square aspect ratio. In short, facet_wrap() wraps the single row of subplots that you would get with facet_grid() into multiple rows.
To use facet_wrap() pass it a single variable name with a ~ before it, e.g. facet_wrap( ~ color).
Add facet_wrap() to the code below to create the graph that appeared at the start of this section. Facet on cut.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = color, fill = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = color, fill = cut)) +
facet_wrap(~ cut)

By default, each facet in your plot will share the same x and y ranges. You can change this by adding a scales argument to facet_wrap() or facet_grid().
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = color, fill = cut)) +
facet_wrap( ~ cut, scales = "free_y")

In this tutorial, you learned how to make bar charts; but much of what you learned applies to other types of charts as well. Here’s what you should know:
Bar charts are an excellent way to display the distribution of a categorical variable. In the next tutorial, we’ll meet a set of geoms that display the distribution of a continuous variable.
Learn to make and customize histograms, a device for visualizing the distribution of continuous variables. Here, you will also learn to make similar plots like dotplots, densities, and frequency polygons.
Histograms are the most popular way to visualize continuous distributions. Here we will look at them and their derivatives. You will learn how to:
The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.
The tutorial uses the ggplot2 and dplyr packages, which have been pre-loaded for your convenience.
Video: https://vimeo.com/221607341
To make a histogram with ggplot2, add geom_histogram() to the ggplot2 template. For example, the code below plots a histogram of the carat variable in the diamonds dataset, which comes with ggplot2.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As with geom_bar(), you do not need to give geom_histogram() a y variable. geom_histogram() will construct its own y variable by counting the number of observations that fall into each bin on the x axis. geom_histogram() will then map the counts to the y axis.
As a result,
you can glance at a bar to determine how many observations fall within a
bin. Bins with tall bars highlight common values of the x variable.
image
According to the chart, which is the most common carat size in the data?
Correct!
More than 15,000 diamonds in the data have a value in the bin near 0.3 and 0.4. That's more than any other bin. How do we know? because the bar above 0.3 to 0.4 goes to 15,000, higher than any other bar in the plot.
By default, ggplot2 will choose a binwidth for your histogram that results in about 30 bins. You can set the binwidth manually with the binwidth argument, which is interpreted in the units of the x axis:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 1)

ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
#### bins
Alternatively, you can set the binwidth with the bins argument which takes the total number of bins to use:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), bins = 10)

ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), bins = 20)

It can be hard to determine what the actual binwidths are when you use bins, since they may not be round numbers.
You can move the bins left and right along the x axis with the boundary argument. boundary takes an x value to use as the boundary between two bins (ggplot2 will align the rest of the bins accordingly):
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), bins = 10, boundary = 0)

When you use geom_histogram(), you should always experiment with different binwidths because different size bins reveal different types of information.
To see an example of this, make a histogram of the carat variable in the diamonds dataset. Use a bin size of 0.5 carats. What does the overall shape of the distribution look like?
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

"Good job! The most common diamond size is about 0.5 carats. Larger sizes become progressively less frequent as carat size increases. This accords with general knowledge about diamonds, so you may be prompted to stop exploring the distribution of carat size. But should you?"
Recreate your histogram of carat but this time use a binwidth of 0.1. Does your plot reveal new information? Look closely. Is there more than one peak? Where do the peaks occur?
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.1)

"Good job! The new binwidth reveals a new phenomena: carat sizes like 0.5, 0.75, 1, 1.5, and 2 are much more common than carat sizes that do not fall near a common fraction. Why might this be?"
Recreate your histogram of carat a final time, but this time use a binwidth of 0.01 and set the first boundary to zero. Try to find one new pattern in the results.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.01, boundary = 0)

"Good job! The new binwidth reveals another phenomena: each peak is very right skewed. In other words, diamonds that are 1.01 carats are much more common than diamonds that are .99 carats. Why would that be?"
Visually, histograms are very similar to bar charts. As a result, they use the same aesthetics: alpha, color, fill, linetype, and size.
They also behave in the same odd way when you use the color aesthetic. Do you remember what happens?
Which aesthetic would you use to color the interior fill of each bar in a histogram?
Correct!
For geoms with "substance", like bars, fill controls the color of the interior of the geom. Color controls the outline.
Recreate the histogram below.
image
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price, fill = cut), position = "stack", binwidth = 1000, boundary = 0)

"Good job! Did you ensure that each binwidth is 1000 and that the first boundary is zero?"
By adding a fill color to our histogram below, we’ve divided the data into five “sub-distributions”: the distribution of price for Fair cut diamonds, for Good cut diamonds, for Very Good cut diamonds, for Premium cut diamonds, and for Ideal cut diamonds.
But this
display has some shortcomings:
We can improve the plot by using a different geom to display the distributions of price values. ggplot2 includes three geoms that display the same information as a histogram, but in different ways:
geom_freqpoly() plots a frequency polygon, which uses a line to display the same information as a histogram. You can think of a frequency polygon as a line that would connect the top of each bar that appears in a histogram, like this:
Note that
the bars are not part of the frequency polygon; they are just there for
reference. geom_freqpoly() recognizes the same parameters as
geom_histogram(), such as bins, binwidth, and boundary.
Create the frequency polygon depicted above. It has a binwidth of 0.25 and starts at the boundary zero.
ggplot(data = diamonds) +
geom_freqpoly(mapping = aes(x = carat), binwidth = 0.25, boundary = 0)

"Good job! By using a line instead of bars, frequency polygons can sometimes do things that histograms cannot."
Use a frequency polygon to recreate our chart of price and cut. Since lines do not have “substance” like bars, you will want to use the color aesthetic instead of the fill aesthetic.
image
ggplot(data = diamonds) +
geom_freqpoly(mapping = aes(x = carat, color = cut), binwidth = 0.25, boundary = 0)

"Good job! Since lines do not occlude each other, `geom_freqpoly()` plots each sub-group against the same baseline: y = 0 (i.e. it unstacks the sub-groups). This makes it easier to compare the distributions. You can now see that for almost every price value, there are more Ideal cut diamonds than there are other types of diamonds."
Our frequency polygon highlights a second shortcoming with our graph: it is difficult to compare the shapes of the distributions because some sub-groups contain more diamonds than others. This compresses smaller subgroups into the bottom of the graph.
You can
avoid this with geom_density().
geom_density() plots a kernel density estimate (i.e. a density curve) for each distribution. This is a smoothed representation of the data, analogous to a smoothed histogram.
Density curves do not plot count on the y axis but density, which is analagous to the count divided by the total number of observations. Densities makes it easy to compare the distributions of sub-groups. When you plot multiple sub-groups, each density curve will contain the same area under its curve.
image
####
Exercise 8 - Density curves
Re-draw our plot with density curves. How do you interpret the results?
image
ggplot(data = diamonds) +
geom_density(mapping = aes(x = carat, color = cut))

"Good job! You can now compare the most common prices for each sub-group. For example, the plot shows that the most common price for most diamonds is near $1000. However, the most common price for Fair cut diamonds is significantly higher, about $2500. We will come back to this oddity in a later tutorial."
Density plots do not take bin, binwidth, and boundary parameters. Instead they recognize kernel and smoothing parameters that are used in the density fitting algorithm, which is fairly sophisticated.
In practice, you can obtain useful results quickly with the default parameters of geom_density(). If you’d like to learn more about density estimates and their tuning parameters, begin with the help page at ?geom_density().
ggplot2 provides a final geom for displaying one dimensional distributions: geom_dotplot(). geom_dotplot() represents each observation with a dot and then stacks dots within bins to create the semblance of a histogram.
Dotplots can provide an intuitive display of the data, but they have several shortcomings. Dotplots are not ideal for large data sets like diamonds, and provide meaningless y axis labels. I also find that the tuning parameters of geom_dotplot() make dotplots too slow to work with for EDA.
ggplot(data = mpg) +
geom_dotplot(mapping = aes(x = displ), dotsize = 0.5, stackdir = "up", stackratio = 1.1)
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.
#### Exercise 9 - Facets
Instead of changing geoms, you can make the sub-groups in our original plot easier to compare by facetting the data. Extend the code below to facet by cut.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price, fill = cut), binwidth = 1000, boundary = 0) +
facet_wrap(~cut)

"Good job! Facets make it easier to compare sub-groups, but at the expense of separating the data. You may decide that frequency polygons and densities allow more direct comparisons."
In this tutorial, you learned how to visualize distributions with histograms, frequency polygons, and densities. But what should you look for in these visualizations?
Look for places with lots of data. Tall bars reveal the most common values in your data; you can expect these values to be the “typical values” for your variable.
Look for places with little data. Short bars reveal uncommon values. These values appear rarely and you might be able to figure out why.
Look for outliers. Bars that appear away from the bulk of the data are outliers, special cases that may reveal unexpected insights.
Sometimes outliers cannot be seen in a plot, but can be inferred from the range of the x axis. For example, many of the plots in this tutorial seemed to extend well past the end of the data. Why? Because the range was stretched to include outliers. When your data set is large, like diamonds, the bar that describes an outlier may be invisible (i.e. less than one pixel high).
Look for clusters.
Look for shape. The shape of a histogram can often indicate whether or not a variable behaves according to a known probability distribution.
The most important thing to remember about histograms, frequency polygons, and dotplots is to explore different binwidths. The binwidth of a histogram determines what information the histogram displays. You cannot predict ahead of time which binwidth will reveal unexpected information.
Here you will learn to make and customize boxplots, a chart type that makes it easy to visualize the relationship between continuous and categorical variables. You will also learn to visualize the relationship between two categorical variables with a counts plot.
Boxplots display the relationship between a continuous variable and a categorical variable. Count plots display the relationship between two categorical variables. In this tutorial, you will learn how to use both. You will learn how to:
The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.
The tutorial uses the ggplot2 and dplyr packages, which have been pre-loaded for your convenience.
Vidoe: https://vimeo.com/222358034
Which of the
sub-plots accurately describes the data above with a
boxplot?
Correct!
To make a boxplot with ggplot2, add geom_boxplot() to the ggplot2 template. For example, the code below uses boxplots to display the relationship between the class and hwy variables in the mpg dataset, which comes with ggplot2.
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy))
#### Categorical and continuous
geom_boxplot() expects the y axis to be continuous, but accepts categorical variables on the x axis. For example, here class is categorical. geom_boxplot() will automatically plot a separate boxplot for each value of x. This makes it easy to compare the distributions of points with different values of x.
####
Exercise 2 - Interpretation
image
Which class of car has the lowest median highway fuel efficiency (hwy value)?
Correct!
Recreate the boxplot below with the diamonds data set.
image
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = cut, y = price))

"Do you notice how many outliers appear in the plot? The boxplot algorithm can identify many outliers if your data is big, perhaps too many. Let's look at ways to suppress the appearance of outliers in your plot."
You can change how outliers look in your boxplot with the parameters outlier.color, outlier.fill, outlier.shape, outlier.size, outlier.stroke, and outlier.alpha (outlier.shape takes a number from 1 to 25).
Unfortunately, you can’t tell geom_boxplot() to ignore outliers completely, but you can make outliers disappear by setting outlier.alpha = 0. Try it in the plot below.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = cut, y = price), outlier.shape = 24,
outlier.fill = "white", outlier.stroke = 0.25)

ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = cut, y = price), outlier.shape = 24,
outlier.fill = "white", outlier.stroke = 0.25, alpha = 0)

Boxplots recognize the following aesthetics: alpha, color, fill, group, linetype, shape, size, and weight.
Of these group can be the most useful. Consider the plot below. It uses a continuous variable on the x axis. As a result, geom_boxplot() is not sure how to split the data into categories: it lumps all of the data into a single boxplot. The result reveals little about the relationship between carat and price.
In the
next sections, we’ll use group to make a more informative plot.
ggplot2 provides three helper functions that you can use to split a continuous variable into categories. Each takes a continuous vector and returns a categorical vector that assigns each value to a group. For example, cut_interval() bins a vector into n equal length bins.
continuous_vector <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
continuous_vector
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 2 3 4 5 6 7 8 9 10
cut_interval(continuous_vector, n = 3)
## [1] [1,4] [1,4] [1,4] [1,4] (4,7] (4,7] (4,7] (7,10] (7,10] (7,10]
## Levels: [1,4] (4,7] (7,10]
## [1] [1,4] [1,4] [1,4] [1,4] (4,7] (4,7] (4,7] (7,10] (7,10] (7,10]
## Levels: [1,4] (4,7] (7,10]
The three cut functions are
Use one of three functions below to bin continuous_vector into groups of width = 2.
cut_width(continuous_vector, width = 2)
## [1] [1,3] [1,3] [1,3] (3,5] (3,5] (5,7] (5,7] (7,9] (7,9] (9,11]
## Levels: [1,3] (3,5] (5,7] (7,9] (9,11]
"Good job! Now let's apply the cut functions to our graph."
When you set the group aesthetic of a boxplot, geom_boxplot() will draw a separate boxplot for each collection of observations that have the same value of whichever vector you map to group.
This means we can split our carat plot by mapping group to the output of a cut function, as in the code below. Study the code, then modify it to create a separate boxplot for each 0.25 wide interval of carat.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = carat, y = price, group = cut_interval(carat, n = 2)))

ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = carat, y = price, group = cut_width(carat, width = 0.25)))

"Good job! You can now see a relationship between price and carat. You could also make a scatterplot of these variables, but in this case, it would be a black mass of 54,000 data points."
geom_boxplot() always expects the categorical variable to appear on the x axis, which create horizontal boxplots. But what if you’d like to make horizontal boxplots, like in the plot below?
image
Extend the code below to orient the boxplots horizontally.
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy))

ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy)) +
coord_flip()

"Good job! `coord_flip()` is an example of a new coordinate system. You'll learn much more about ggplot2 coordinate systems in a later tutorial."
Boxplots provide a quick way to represent a distribution, but they leave behind a lot of information. ggplot2 supplements boxplots with two geoms that show more information.
The first is geom_dotplot(). If you set the binaxis parameter of geom_dotplot() to “y”, geom_dotplot() behaves like geom_boxplot(), display a separate distribution for each group of data.
Here each group functions like a vertical histogram. Add the parameter stackdir = “center” then re-run the code. Can you interpret the results?
ggplot(data = mpg) +
geom_dotplot(mapping = aes(x = class, y = hwy), binaxis = "y",
dotsize = 0.5, binwidth = 1)

ggplot(data = mpg) +
geom_dotplot(mapping = aes(x = class, y = hwy), binaxis = "y",
dotsize = 0.5, binwidth = 1, stackdir = "center")

'Good job! When you set `stackdir = "center"`, `geom_dotplot()` arranges each row of dots symmetrically around the $x$ value. This layout will help you understand the next geom. As in the histogram tutorial, it takes a lot of tweaking to make a dotplot look right. As a result, I tend to only use them when I want to make a point.'
geom_violin() provides a second alternative to geom_boxplot(). A violin plot uses densities to draw a smoothed version of the centered dotplot you just made.
You can think of a violin plot as an outline drawn around the edges of a centered dotplot. Each “violin” spans the range of the data. The violin is thick where there are many values, and thin where there are few.
Convert the plot below from a boxplot to a violin plot. Note that violin plots do not use the parameters you saw for dotplots.
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy))

ggplot(data = mpg) +
geom_violin(mapping = aes(x = class, y = hwy))

'Good job! Another way to interpret a violin plot is to mentally "push" the width of each violin all to one side (so the other side is a straight line). The result would be a density (e.g. `geom_density()`) turned on its side for each distribution).'
You can further enhance violin plots by adding the parameter draw_quantiles = c(0.25, 0.5, 0.75). This will cause ggplot2 to draw horizontal lines across the violins at the 25th, 50th, and 75th percentiles. These are the same three horizontal lines that are displayed in a boxplot (the 25th and 75th percentiles are the bounds of the box, the 50th percentile is the median).
Add these lines to the violin plot below.
ggplot(data = mpg) +
geom_violin(mapping = aes(x = class, y = hwy))

ggplot(data = mpg) +
geom_violin(mapping = aes(x = class, y = hwy), draw_quantiles = c(0.25, 0.5, 0.75))

Boxplots provide an efficient way to explore the interaction of a continuous variable and a categorical variable. But what if you have two categorical variables?
You can see how observations are distributed across two categorical variables with geom_count(). geom_count() draws a point at each combination of values from the two variables. The size of the point is mapped to the number of observations with this combination of values. Rare combinations will have small points, frequent combinations will have large points.
image
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = clarity))

You can use the count() function in the dplyr package to compute the count values displayed by geom_count(). To use count(), pass it a data frame and then the names of zero or more variables in the data frame. count() will return a new table that lists how many observations occur with each possible combination of the listed variables.
So for example, the code below returns the counts that you visualized in Exercise 8.
diamonds %>%
count(cut, clarity)
## # A tibble: 40 × 3
## cut clarity n
## <ord> <ord> <int>
## 1 Fair I1 210
## 2 Fair SI2 466
## 3 Fair SI1 408
## 4 Fair VS2 261
## 5 Fair VS1 170
## 6 Fair VVS2 69
## 7 Fair VVS1 17
## 8 Fair IF 9
## 9 Good I1 96
## 10 Good SI2 1081
## # … with 30 more rows
Heat maps provide a second way to visualize the relationship between two categorical variables. They work like count plots, but use a fill color instead of a point size, to display the number of observations in each combination.
ggplot2 does not provide a geom function for heat maps, but you can construct a heat map by plotting the results of count() with geom_tile().
To do this, set the x and y aesthetics of geom_tile() to the variables that you pass to count(). Then map the fill aesthetic to the n variable computed by count(). The plot below displays the same counts as the plot in Exercise 8.
diamonds %>%
count(cut, clarity) %>%
ggplot() +
geom_tile(mapping = aes(x = cut, y = clarity, fill = n))

Practice the method above by re-creating the heat map below.
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
#### Recap
Boxplots, dotplots and violin plots provide an easy way to look for relationships between a continuous variable and a categorical variable. Violin plots convey a lot of information quickly, but boxplots have a head start in popularity — they were easy to use when statisticians had to draw graphs by hand.
In any of these graphs, look for distributions, ranges, medians, skewness or anything else that catches your eye to change in an unusual way from distribution to distribution. Often, you can make patterns even more revealing with the fct_reorder() function from the forcats package (we’ll wait to learn about forcats until after you study factors).
Count plots and heat maps help you see how observations are distributed across the interactions of two categorical variables.
This tutorial revisits scatterplots, which display the relationship between two continuous variables. Along the way, you will learn to build multi-layer plots and to use new coordinate systems.
A scatterplot displays the relationship between two continuous variables. Scatterplots are one of the most common types of graphs—in fact, you’ve met scatterplots already in Visualization Basics.
In this tutorial, you’ll learn how to:
The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.
The tutorial uses the ggplot2, ggrepel, and dplyr packages, which have been pre-loaded for your convenience.
In Visualization Basics, you learned how to make a scatterplot with geom_point().
The code below summarises the mpg data set and begins to plot the results. Finish the plot with geom_point(). Put mean_cty on the x axis and mean_hwy on the y axis.
mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot()
mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot() +
geom_point(mapping = aes(x = mean_cty, y = mean_hwy))

"Good job! It can be tricky to remember when to use %>% and when to use +. Use %>% to add one complete step to a pipe of code. Use + to add one more line to a ggplot2 call."
geom_text() and geom_label() create scatterplots that use words instead of points to display data. Each requires the extra aesthetic label, which you should map to a variable that contains text to display for each observation.
Convert the plot below from geom_point() to geom_text() and map the label aesthetic to the class variable. When you are finished convert the code to geom_label() and rerun the plot. Can you spot the difference?
mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot() +
geom_point(mapping = aes(x = mean_cty, y = mean_hwy))

mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot() +
geom_text(mapping = aes(x = mean_cty, y = mean_hwy, label = class))

mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot() +
geom_label(mapping = aes(x = mean_cty, y = mean_hwy, label = class))

"Good job! geom_text() replaces each point with a piece of text supplied by the label aesthetic. geom_label replaces each point with a textbox. Notice that some pieces of text overlap each other, and others run off the page. We'll soon look at a way to fix this."
In Visualization Basics, you met geom_smooth(), which provides a summarised version of a scatterplot.
geom_smooth() uses a model to fit a smoothed line to the data and then visualizes the results. By default, geom_smooth() fits a loess smooth to data sets with less than 1,000 observations, and a generalized additive model to data sets with more than 1,000 observations.

You can use the method parameter of geom_smooth() to fit and display other types of model lines. To do this, pass method the name of an R modeling function for geom_smooth() to use, such as lm (for linear models) or glm (for generalized linear models).
In the code below, use geom_smooth() to draw the linear model line that fits the data.
mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot()
mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot() +
geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

"Good job! Now let's look at a way to make geom_smooth() much more useful."
geom_smooth() becomes much more useful when you combine it with geom_point() to create a scatterplot that contains both:
In ggplot2, you can add multiple geoms to a plot by adding multiple geom functions to the plot call. For example, the code below creates a plot that contains both points and a smooth line. Imagine what the results will look like in your head, and then run the code to see if you are right.
mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot() +
geom_point(mapping = aes(x = mean_cty, y = mean_hwy)) +
geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy), method = lm)
## `geom_smooth()` using formula = 'y ~ x'

"Good job! You can add as many geom functions as you like to a plot; but, in practice, a plot will become hard to interpret if it contains more than two or three geoms."
Do you remember how the labels that we made early overlapped each other and ran off our graph? The geom_label_repel() geom from the ggrepel package mitigates these problems by using an algorithm to arrange labels within a plot. It works best in conjunction with a layer of points that displays the true location of each observation.
Use geom_label_repel() to add a new layer to our plot below. geom_label_repel() requires the same aesthetics as geom_label(): x, y, and label (here set to class).
mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot() +
geom_point(mapping = aes(x = mean_cty, y = mean_hwy)) +
geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy), method = lm)
## `geom_smooth()` using formula = 'y ~ x'

library(ggrepel)
mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot() +
geom_point(mapping = aes(x = mean_cty, y = mean_hwy)) +
geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy), method = lm) +
geom_label_repel(mapping = aes(x = mean_cty, y = mean_hwy, label = class))
## `geom_smooth()` using formula = 'y ~ x'

mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot() +
geom_point(mapping = aes(x = mean_cty, y = mean_hwy)) +
geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy), method = lm) +
geom_text_repel(mapping = aes(x = mean_cty, y = mean_hwy, label = class))
## `geom_smooth()` using formula = 'y ~ x'

"Good job! The ggrepel package also provides geom_text_repel(), which is an analog for geom_text()."
If you study the solution for the previous exercise, you’ll notice a fair amount of duplication. We set the same aesthetic mappings in three different places.
mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot() +
geom_point(mapping = aes(x = mean_cty, y = mean_hwy)) +
geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy), method = lm) +
geom_label_repel(mapping = aes(x = mean_cty, y = mean_hwy, label = class))
## `geom_smooth()` using formula = 'y ~ x'

You should try to avoid duplication whenever you can in code because duplicated code invites typos, is hard to update, and takes longer than needed to write. Thankfully, ggplot2 provides a way to avoid duplication across multiple layers.
You can set aesthetic mappings in two places within any ggplot2 call. You can set the mappings inside of a geom function, as we’ve been doing. Or you can set the mappings inside of the ggplot() function like below:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()

ggplot2 will treat any mappings set in the ggplot() function as global mappings. Each layer in the plot will inherit and use these mappings.
ggplot2 will treat any mappings set in a geom function as local mappings. Only the local layer will use these mappings. The mappings will override the global mappings if the two conflict, or add to them if they do not.
This system creates an efficient way to write plot calls:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(mapping = aes(color = class), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 5.6935
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.5065
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.65044
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 4.008
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.708
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.6135e-17
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.25

Reduce duplication in the code below by moving as many local mappings into the global mappings as possible. Rerun the new code to ensure that it creates the same plot.
mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot() +
geom_point(mapping = aes(x = mean_cty, y = mean_hwy)) +
geom_smooth(mapping = aes(x = mean_cty, y = mean_hwy), method = lm) +
geom_label_repel(mapping = aes(x = mean_cty, y = mean_hwy, label = class))
## `geom_smooth()` using formula = 'y ~ x'

mpg %>%
group_by(class) %>%
summarise(mean_cty = mean(cty), mean_hwy = mean(hwy)) %>%
ggplot(mapping = aes(x = mean_cty, y = mean_hwy)) +
geom_point() +
geom_smooth(method = lm) +
geom_label_repel(mapping = aes(label = class))
## `geom_smooth()` using formula = 'y ~ x'
#### Exercise 3 - Global vs. Local
Recreate the plot below in the most efficient way possible.
image
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

"Good Job!"
The data argument also follows a global vs. local system. If you set the data argument of a geom function, the geom will use the data you supply instead of the data contained in ggplot(). This is a convenient way to highlight groups of points.
Use data arguments to recreate the plot below. I’ve started the code for you.
image
mpg2 <- filter(mpg, class == "2seater")
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = mpg2, color = "red", size = 2)

"Good Job!"
Use data arguments to recreate the plot below.
image
mpg3 <- filter(mpg, hwy > 40)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_label_repel(data = mpg3, mapping = aes(label = class))

When exploring data, you’ll often make a plot and then think of a way to improve it. Instead of starting from scratch or copying and pasting your code, you can use ggplot2’s last_plot() function. last_plot() returns the most recent plot call, which makes it easy to build up a plot one layer at a time.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()

last_plot() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

last_plot() +
geom_smooth(method = lm, color = "purple")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#### Saving plots
If you’d like to work with a plot later, you can save it to an R object. Later you can display the plot or add to it, as if you were using last_plot().
p <- ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy))
Notice that ggplot2 will not display a plot when you save it. It waits until you call the saved object.
p

geom_rug() adds another type of summary to a plot. It uses displays the one dimensional marginal distributions of each variable in the scatterplot. These appear as collections of tickmarks along the x and y axes.
In the chunk below, use the faithful dataset to create a scatterplot that has the waiting variable on the x axis and the eruptions variable on the y axis. Use geom_rug() to add a rug plot to the scatterplot. Like geom_point(), geom_rug() requires x and y aesthetic mappings.
ggplot(data = faithful, mapping = aes(x = waiting, y = eruptions)) +
geom_point() +
geom_rug()

geom_jitter() plots a scatterplot and then adds a small amount of random noise to each point in the plot. It is a shortcut for adding a “jitter” position adjustment to a points plot (i.e, geom_point(position = “jitter”)).
Why would you use geom_jitter()? Jittering provides a simple way to inspect patterns that occur in heavily gridded or overlapping data. To see what I mean, replace geom_point() with geom_jitter() in the plot below.
ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = hwy))

ggplot(data = mpg) +
geom_jitter(mapping = aes(x = class, y = hwy))

"Good job! You can also jitter in only a single direction. To turn off jittering in the x direction set width = 0 in geom_jitter(). To turn off jittering in the y direction, set height = 0."
geom_jitter() provides a convenient way to overlay raw data on boxplots, which display summary information.
Use the chunk below to create a boxplot of the previous graph. Arrange for the outliers to have an alpha of 0, which will make them completely transparent. Then add a layer of points that are jittered in y direction, but not the x direction.
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot(outlier.alpha = 0) +
geom_jitter(width = 0)

One way to customize a scatterplot is to plot it in a new coordinate system. ggplot2 provides several helper functions that change the coordinate system of a plot. You’ve already seen one of these in action in the boxplots tutorial: coord_flip() flips the x and y axes of a plot.
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot(outlier.alpha = 0) +
geom_jitter(width = 0) +
coord_flip()
#### The coord functions
Altogether, ggplot2 comes with seven coord functions:
By default, ggplot2 will draw a plot in Cartesian coordinates unless you add one of the functions above to the plot code.
You use each coord function like you use coord_flip(), by adding it to a ggplot2 call.
So for example, you could add coord_polar() to a plot to make a graph that uses polar coordinates.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut), width = 1)

last_plot() +
coord_polar()

How can a coordinate system improve a scatterplot?
Consider, the scatterplot below. It shows a strong relationship between the carat size of a diamond and its price.
However,
the relationship does not appear linear. It appears to have the form
y=xn, a common relationship found in nature. You can estimate the n by
replotting the data in a log-log plot.
Log-log plots graph the log of x vs. the log of y, which has a valuable visual effect. If you log both sides of a relationship like
\[y = x^n\]
You get a linear relationship with slope n:
\[\log(y) = \log(x^n)\] \[\log(y)= n\cdot \log(x)\]
In other words, log-log plots unbend power relationships into straight lines. Moreover, they display n as the slope of the straight line, which is reasonably easy to estimate.
Try this by using the diamonds dataset to plot log(carat) against log(price).
ggplot(data = diamonds) +
geom_point(mapping = aes(x = log(price), y = log(carat)))
#### coord_trans()
coord_trans() provides a second way to do the same transformation, or similar transformations.
To use coord_trans() give it an x and/or a y argument. Set each to the name of an R function surrounded by quotation marks. coord_trans() will use the function to transform the specified axis before plotting the raw data.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price)) +
coord_trans(x = "log", y = "log")

Scatterplots are one of the most useful types of plots for data science. You will have many chances to use geom_point(), geom_smooth(), and geom_label_repel() in your day to day work.
However, this tutor introduced important two concepts that apply to more than just scatterplots:
Learn to connect data points to make line plots, polygon plots, and even maps.
A line graph displays a functional relationship between two continuous variables. A map displays spatial data. The two may seem different, but they are made in similar ways. This tutorial will examine them both.
In this tutorial, you’ll learn how to:
The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.
The tutorial uses the ggplot2, maps, mapproj, and dplyr packages, which have been pre-loaded for your convenience.
Like scatterplots, line graphs display the relationship between two continuous variables. However, unlike scatterplots, line graphs expect the variables to have a functional relationship, where each value of x is associated with only one value of y.
For example, in the plot below, there is only one value of unemploy for each value of date.
####
geom_line()
Use the geom_line() function to make line graphs. Like geom_point(), it requires x and y aesthetics.
Use geom_line() in the chunk below to recreate the graph above. The graph uses the economics dataset that comes with ggplot2 and maps the date and unemploy variables to the x and y axes. See Visualization Basics if you are completely stuck.
ggplot(data = economics) +
geom_line(mapping = aes(x = date, y = unemploy))

"Good Job! The graph shows the number of unemployed people in the US (in thousands) from 1967 to 2015. Now let's look at a more rich dataset."
I’ve used the gapminder package to assemble a new data set named asia to plot. Among other things, asia contains the per capita GDP of four countries from 1952 to 2007.
The following code uses gapminder package: https://CRAN.R-project.org/package=gapminder
library(gapminder)
gapminder
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
unique(filter(gapminder, continent == "Asia")$country)
## [1] Afghanistan Bahrain Bangladesh Cambodia
## [5] China Hong Kong, China India Indonesia
## [9] Iran Iraq Israel Japan
## [13] Jordan Korea, Dem. Rep. Korea, Rep. Kuwait
## [17] Lebanon Malaysia Mongolia Myanmar
## [21] Nepal Oman Pakistan Philippines
## [25] Saudi Arabia Singapore Sri Lanka Syria
## [29] Taiwan Thailand Vietnam West Bank and Gaza
## [33] Yemen, Rep.
## 142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe
Asia <- filter(gapminder, country %in% c("China", "Japan", "Korea, Dem. Rep.", "Korea, Rep."))
asia <- Asia %>%
mutate(country = case_when(country =="Korea, Dem. Rep." ~ "North Korea",
country == "Korea, Rep." ~ "South Korea",
TRUE ~ as.character(country)))
asia
## # A tibble: 48 × 6
## country continent year lifeExp pop gdpPercap
## <chr> <fct> <int> <dbl> <int> <dbl>
## 1 China Asia 1952 44 556263527 400.
## 2 China Asia 1957 50.5 637408000 576.
## 3 China Asia 1962 44.5 665770000 488.
## 4 China Asia 1967 58.4 754550000 613.
## 5 China Asia 1972 63.1 862030000 677.
## 6 China Asia 1977 64.0 943455000 741.
## 7 China Asia 1982 65.5 1000281000 962.
## 8 China Asia 1987 67.3 1084035000 1379.
## 9 China Asia 1992 68.7 1164970000 1656.
## 10 China Asia 1997 70.4 1230075000 2289.
## # … with 38 more rows
However, when we plot the asia data we get an odd looking graph. The line seems to “whipsaw” up and down. Whipsawing is one of the most encountered challenges with line graphs.
ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap))

You’ve encountered whipsawing before in the Data Basics tutorial. What does whipsawing indicate?
Correct!
As a result, our single line needs to connect multiple points for each x value before moving to the next x value.
ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap))

"Good job! There are actually four lines in the plot. One for each country: China, Japan, North Korea, and South Korea."
Many geoms, like lines, boxplots, and smooth lines, use a single object to display the entire dataset. You can use the group aesthetic to instruct these geoms to draw separate objects for different groups of observations.
For example, in the code below, you can map group to the grouping variable country to create a separate line for each country. Try it. Be sure to place the group mapping inside of aes().
ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap))

ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap, group = country))

"Good job! We now have a separate line for each country. Unfortunately, we cannot tell what the countries are: the group aesthetic does not supply a legend. Let's look at how to fix that."
You do not have to rely on the group aesthetic to perform a grouping. ggplot2 will automatically group a monolithic geom whenever you map an aesthetic to a categorical variable.
So for example, the code below performs an implied grouping. And since we use the color aesthetic, the plot includes the color legend.
ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap, color = country))
#### linetype
Lines recognize a useful aesthetic that we haven’t encountered before, linetype. Change color to linetype below and inspect the results. What happens if you map both a color and a linetype to country?
ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap, color = country))

ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap, color = country, linetype = country))

"Good job! If you map two aesthetics to the same variable, ggplot2 will combine their legends. Supplementing color with linetype is a good idea if you might print your line chart in black and white."
Use what you’ve learned to plot the life expectancy of each country over time. Life expectancy is saved in the asia data set as lifeExp. Which country has the highest life expectancy? The lowest?
ggplot(asia) +
geom_line(mapping = aes(x = year, y = lifeExp, color = country, linetype = country))

geom_step() draws a line chart in a stepwise fashion. To see what I mean, change the geom in the plot below and rerun the code.
ggplot(asia) +
geom_line(mapping = aes(x = year, y = lifeExp, color = country, linetype = country))

ggplot(asia) +
geom_step(mapping = aes(x = year, y = lifeExp, color = country, linetype = country))

'Good job! You can control whether the steps move horizontally first and then vertically or vertically first and then horizontally with the parameters `direction = "hv"` (the default) or `direction = "vh"`.'
geom_area() is similar to a line graph, but it fills in the area under the line. To see geom_area() in action, change the geom in the plot below and rerun the code.
ggplot(economics) +
geom_line(mapping = aes(x = date, y = unemploy))

ggplot(economics) +
geom_area(mapping = aes(x = date, y = unemploy))

Do you recall from Visualization Basics how you would set the fill of our plot to blue (instead of, say, map the fill to a variable)? Give it a try.
ggplot(economics) +
geom_area(mapping = aes(x = date, y = unemploy))

ggplot(economics) +
geom_area(mapping = aes(x = date, y = unemploy),fill = "blue")

geom_area() is a great choice if your measurements represent the accumulation of objects (like unemployed people). Notice that the y axis geom_area() always begins or ends at zero.
Perhaps because of this, geom_area() can be quirky when you have multiple groups. Run the code below. Can you tell what happens here?
ggplot(asia) +
geom_area(mapping = aes(x = year, y = lifeExp, fill = country))

If you answered that people in China were living to be 300 years old, you guessed wrong.
geom_area() is stacking each group above the group below. As a result, the line that should display the life expectancy for China displays the combined life expectancy for all countries.
You can fix this by changing the position adjustment for geom_area(). Give it a try below. Change the position parameter from “stack” (the implied default) to “identity”. See Bar Charts if you’d like to learn more about position adjustments.
ggplot(asia) +
geom_area(mapping = aes(x = year, y = lifeExp, fill = country), alpha = 0.3)

ggplot(asia) +
geom_area(mapping = aes(x = year, y = lifeExp, fill = country), position = "identity", alpha = 0.3)

"Good Job! You can further customize your graph by switching from `geom_area()` to `geom_ribbon()`. `geom_ribbon()` lets you map the bottom of the filled area to a variable, as well as the top. See `?geom_ribbon` if you'd like to learn more."
geom_line() comes with a strange bed-fellow, geom_path(). geom_path() draws a line between points like geom_line(), but instead of connecting points in the order that they appear along the x axis, geom_path() connects the points in the order that they appear in the data set.
It starts with the observation in row one of the data and connects it to the observation in row two, which it then connects to the observation in row three, and so on.
To see how geom_path() does this, let’s rearrange the rows in the economics dataset. We can reorder them by unemploy value. Now the data set will begin with the observation that had the lowest value of unemploy.
economics2 <- economics %>%
arrange(unemploy)
economics2
## # A tibble: 574 × 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1968-12-01 576. 201621 11.1 4.4 2685
## 2 1968-09-01 568. 201095 10.6 4.6 2686
## 3 1968-10-01 572. 201290 10.8 4.8 2689
## 4 1969-02-01 589. 201881 9.7 4.9 2692
## 5 1968-04-01 544 200208 12.3 4.6 2709
## 6 1969-03-01 589. 202023 10.2 4 2712
## 7 1969-05-01 600. 202331 10.1 4.2 2713
## 8 1968-11-01 577. 201466 10.6 4.4 2715
## 9 1969-01-01 584. 201760 10.3 4.4 2718
## 10 1968-05-01 550. 200361 12 4.4 2740
## # … with 564 more rows
If we plot the reordered data with both geom_line() and geom_path() we get two very different graphs.
ggplot(economics2) +
geom_line(mapping = aes(x = date, y = unemploy))

ggplot(economics2) +
geom_path(mapping = aes(x = date, y = unemploy))
The plot on the left uses geom_line(), hence the points are connected in
order along the x axis. The plot on the right uses geom_path(). These
points are connected in the order that they appear in the dataset, which
happens to put them in order along the y axis.
Why would you want to use geom_path()? The code below illustrates one particularly useful case. The tx dataset contains latitude and longitude coordinates saved in a specific order.
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
tx <- map_data("state", region = "texas")
tx
## # A tibble: 1,088 × 6
## long lat group order region subregion
## <dbl> <dbl> <dbl> <int> <chr> <chr>
## 1 -94.5 33.7 1 1 texas <NA>
## 2 -94.5 33.7 1 2 texas <NA>
## 3 -94.5 33.6 1 3 texas <NA>
## 4 -94.5 33.6 1 4 texas <NA>
## 5 -94.5 33.6 1 5 texas <NA>
## 6 -94.4 33.6 1 6 texas <NA>
## 7 -94.4 33.6 1 7 texas <NA>
## 8 -94.4 33.6 1 8 texas <NA>
## 9 -94.4 33.6 1 9 texas <NA>
## 10 -94.3 33.6 1 10 texas <NA>
## # … with 1,078 more rows
ggplot(tx) +
geom_path(mapping = aes(x = long, y = lat))

image
"Good job! `geom_path()` reveals how you can use what is essentially a line plot to make a map (this is a map of the state of Texas). There are other ways to make maps in R, but this low tech method is surprisingly versatile."
geom_polygon() extends geom_path() one step further: it connects the last point to the first and then colors the interior region with a fill. The result is a polygon.
ggplot(tx) +
geom_polygon(mapping = aes(x = long, y = lat))

image
What do you think went wrong in the plot of Texas below?
What went wrong?
Correct!
It looks like someone messed with tx. tx and datasets like it will have an order variable that you can use to ensure that the data is in the correct order before you plot it.
The tx data set comes from the maps package, which is an R package that contains similarly formatted data sets for many regions of the globe.
A short list of the datasets saved in maps includes: france, italy, nz, usa, world, and world2, along with county and state. These last two map the US at the county and state levels. To learn more about maps, run help(package = maps).
You do not need to access the maps package to use its data. ggplot2 provides the function map_data() which fetches maps from the maps package and returns them in a format that ggplot2 can plot.
To use map_data() give it the name of a dataset to retrieve. You can retrieve a subset of the data by providing an optional region argument. For example, I can use this code to retrieve a map of Florida from state, which is the dataset that contains all 50 US states.
fl <- map_data("state", region = "florida")
ggplot(fl) +
geom_polygon(mapping = aes(x = long, y = lat))

Alter the code to retrieve and plot your home state (Try Idaho if you are outside of the US). Notice the capitalization.
library(maps)
id <- map_data("state", region = "idaho")
ggplot(id) +
geom_polygon(mapping = aes(x = long, y = lat))

If you do not specify a region, map_data() will retrieve the entire data set, in this case state.
us <- map_data("state")
In practice, you will often have to retrieve an entire dataset at least once to learn what region names to use with map_data(). The names will be stored in the region column of the dataset.
The code below retrieves and plots the entire state data set, but something goes wrong. What?
us <- map_data("state")
ggplot(us) +
geom_polygon(mapping = aes(x = long, y = lat))

In this case, our data is not out of order, but it contains more than one polygon: it contains 50 polygons—one for each state.
By default, geom_polygon() tries to plot a single polygon, which causes it to connect multiple polygons in weird ways.
map('world', fill = TRUE, col = 1:10, wrap=c(-180,180) )

Which aesthetic can you use to plot multiple polygons? In the code below, map the aesthetic to the group variable in the state dataset. This variable contains all of the grouping information needed to make a coherent map. Then rerun the code.
ggplot(us) +
geom_polygon(mapping = aes(x = long, y = lat))

ggplot(us) +
geom_polygon(mapping = aes(x = long, y = lat, group = group))

R comes with a data set named USArrests that we can use in conjunction with our plot above to make a choropleth map. A choropleth map uses the color of each region in the plot to display some value associated with the region.
In our case we will use the UrbanPop variable of USAarrests which records how urbanized each state was in 1973. UrbanPop is the percent of the population who lived within a city.
USArrests
## # A tibble: 50 × 4
## Murder Assault UrbanPop Rape
## <dbl> <int> <int> <dbl>
## 1 13.2 236 58 21.2
## 2 10 263 48 44.5
## 3 8.1 294 80 31
## 4 8.8 190 50 19.5
## 5 9 276 91 40.6
## 6 7.9 204 78 38.7
## 7 3.3 110 77 11.1
## 8 5.9 238 72 15.8
## 9 15.4 335 80 31.9
## 10 17.4 211 60 25.8
## # … with 40 more rows
You can use geom_map() to create choropleth maps. geom_map() pairs a data frame like USArrests with a map dataset like us by matching region names.
To use geom_map(), we first need to ensure that a common set of region names appears across both datasets.
At the moment, this isn’t the case. USArrests uses capitalized state names and hides them outside of the dataset in the row names (instead of in a column). In contrast, us uses a column of lower case state names. The code below fixes this.
USArrests2 <- USArrests %>%
rownames_to_column("region") %>%
mutate(region = tolower(region))
USArrests2
## # A tibble: 50 × 5
## region Murder Assault UrbanPop Rape
## <chr> <dbl> <int> <int> <dbl>
## 1 alabama 13.2 236 58 21.2
## 2 alaska 10 263 48 44.5
## 3 arizona 8.1 294 80 31
## 4 arkansas 8.8 190 50 19.5
## 5 california 9 276 91 40.6
## 6 colorado 7.9 204 78 38.7
## 7 connecticut 3.3 110 77 11.1
## 8 delaware 5.9 238 72 15.8
## 9 florida 15.4 335 80 31.9
## 10 georgia 17.4 211 60 25.8
## # … with 40 more rows
To use geom_map():
Initialize a plot with the data set that contains your data. Here that is USArrests2.
Add geom_map(). Set the map_id aesthetic to the variable that contains the regions names. Then set the fill aesthetic to the fill variable. You do not need to supply x and y aesthetics, geom_map() will derive these values from the map data set, which you must set with the map parameter. Since map is a parameter, it should go outside the aes() function.
Follow geom_map() with expand_limits(), and tell expand_limits() what the x and y variables in the map dataset are. This shouldn’t be necessary in future iterations of geom_map(), but for now ggplot2 will use the x and y arguments of expand_limits() to build the bounding box for your plot.
ggplot(USArrests2) +
geom_map(aes(map_id = region, fill = UrbanPop), map = us) +
expand_limits(x = us$long, y = us$lat)
“Congratulations! You’ve used geom_map() to make your first choropleth
plot! To test your understanding, alter the code to display the Murder,
Assault, or Rape variables.”
You may have noticed that our maps look a little off. So far, we’ve plotted them in Cartesian coordinates, which distort the spherical surface described by latitude and longitude. Also, ggplot2 adjusts the aspect ratio of our plots to fit our graphing window, which can further distort our maps.
You can avoid both of these distortions by adding coord_map() to your plot. coord_map() displays the plot in a fixed cartographic projection. Note that coord_map(), relies on the mapproj package, so you’ll need to have mapproj installed before you use coord_map().
library(mapproj)
ggplot(USArrests2) +
geom_map(aes(map_id = region, fill = UrbanPop), map = us) +
expand_limits(x = us$long, y = us$lat) +
coord_map()

By default, coord_map() replaces the coordinate system with a Mercator projection. To use a different projection, set the projection argument of coord_map() to a projection name, surrounded by quotation marks.
To see this, extend the code below to view the map in a “sinusoidal” projection.
ggplot(USArrests2) +
geom_map(aes(map_id = region, fill = UrbanPop), map = us) +
expand_limits(x = us$long, y = us$lat)

ggplot(USArrests2) +
geom_map(aes(map_id = region, fill = UrbanPop), map = us) +
expand_limits(x = us$long, y = us$lat) +
coord_map(projection = "sinusoidal")

You can now make all of the plots recommended in the Exploratory Data Analysis tutorial. The next tutorial in this primer will teach you several strategies for dealing with overplotting, a problem that can occur when you have large data or low resolution data.
Here you will learn to handle a problem that occur when graphing data—especially large data. Along the way, you will meet several new geoms.
Data Visualization is a useful tool because it makes data accessible to your visual system, which can process large amounts of information quickly. However, two characteristics of data can short circuit this system. Data can not be easily visualized if
These features both create overplotting, the condition where multiple geoms in the plot are plotted on top of each other, hiding each other. This tutorial will show you several strategies for dealing with overplotting, introducing new geoms along the way.
The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.
The tutorial uses the ggplot2 and hexbin packages, which have been pre-loaded for your convenience.
You’ve seen this plot several times in previous tutorials, but have you noticed that it only displays 126 points? This is unusual because the plot visualizes a data set that contains 234 points.
The missing
points are hidden behind other points, a phenomenon known as
overplotting. Overplotting is a problem because it provides an
incomplete picture of the dataset. You cannot determine where the mass
of the points fall, which makes it difficult to spot relationships in
the data.
Overplotting usually occurs for two different reasons:
How you deal with overplotting will depend on the cause.
If your overplotting is due to rounding, you can obtain a better picture of the data by making each point semi-transparent. For example you could set the alpha aesthetic of the plot below to a value less than one, which will make the points transparent.
Try this now. Set the points to an alpha of 0.25, which will make each point 25% opague (i.e. four points staked on top of each other will create a solid black).
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), alpha = 0.25)

"Good job! You can now identify which values contain more observations. The darker locations contain several points stacked on top of each other."
A second strategy for dealing with rounding is to adjust the position of each point. position = “jitter” adds a small amount of random noise to the location of each point. Since the noise is random, it is unlikely that two points rounded to the same location will also be jittered to the same location.
The result is a jittered plot that displays more of the data. Jittering comes with both limitations and benefits. You cannot use a jittered plot to see the local values of the points, but you can use a jittered plot to perceive the global relationship between the variables, something that is hard to do in the presence of overplotting.
#### Review
- jitter
In the Scatterplots tutorial, you learned of a geom that displays the equivalent of geom_point() with a position = “jitter” adjustment.
Rewrite the code below to use that geom. Do you obtain similar results?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy))

"Good job! You can now identify which values contain more observations. The darker locations contain several points stacked on top of each other."
A dataset does not need to be truly “Big Data” to be hard to visualize. The diamonds data set contains less than 54,000 points, but it still suffers from overplotting when you try to plot carat vs. price. Here the bulk of the points fall on top of each other in an impenetrable cloud of blackness.
image
Alpha and jittering are less useful for large data. Jittering will not separate the points, and a mass of transparent points can still look black.
A better way to deal with overplotting due to large data is to visualize a summary of the data. In fact, we’ve already worked with this dataset by using geoms that naturally summarise the data, like geom_histogram() and geom_smooth().
Let’s look at
several other geoms that you can use to summarise relationships in large
data.
Boxplots efficiently summarise data, which make them a useful tool for large data sets. In the boxplots tutorial, you learned how to use cut_width() and the group aesthetic to plot multiple boxplots for a continuous variable.
Modify the code below to cut the carat axis into intervals with width 0.2. Then set the group aesthetic of geom_boxplot() to the result.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = carat, y = price))
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = carat, y = price, group = cut_width(carat, width = 0.2)))

"Good job! The medians of the boxplots give a somewhat more precise description of the relationship between carat and price than does the fan of individual points."
geom_bin2d() provides a new way to summarise two dimensional continuous relationships. You can think of bin2d as working like a three dimensional histogram. It divides the Cartesian field into small rectangular bins, like a checkerboard. It then counts how many points fall into each bin, and maps the count to color. Bins that contain no points are left blank.
image
By studying the results, we can see that the mass of points falls in the bottom left of the graph.
Like histograms, bin2d use bins and binwidth arguments. Each should be set to a vector of two numbers: one for the number of bins (or binwidths) to use on the x axis, and one for the number of bins (or binwidths) to use on the y axis.
Use one of these parameters to modify the graph below to use 40 bins on the x axis and 50 on the y axis.
ggplot(data = diamonds) +
geom_bin2d(mapping = aes(x = carat, y = price))

ggplot(data = diamonds) +
geom_bin2d(mapping = aes(x = carat, y = price), bins = c(40, 50))

"Good job! As with histograms, bin2ds can reveal different information at different binwidths."
Our eyes are drawn to straight vertical and horizontal lines, which makes it easy to perceive “edges” in a bin2d that are not necessarily there (the rectangular bins naturally form edges that span the breadth of the graph).
One way to avoid this, if you like, is to use geom_hex(). geom_hex() functions like geom_bin2d() but uses hexagonal bins. Adjust the graph below to use geom_hex().
ggplot(data = diamonds) +
geom_bin2d(mapping = aes(x = carat, y = price))
ggplot(data = diamonds) +
geom_hex(mapping = aes(x = carat, y = price))
geom_density2d() provides one last way to summarize a two dimensional continuous relationship. Think of density2d as the two dimensional analog of density. Instead of drawing a line that rises and falls on the y dimension, it draws a field over the coordinate axes that rises and falls on the z dimension, that’s the dimension that points straight out of the graph towards you.
The result is similar to a mountain that you are looking straight down upon. The high places on the mountain show where the most points fall and the low places show where the fewest points fall. To visualize this mountain, density2d draws contour lines that connect areas with the same “height”, just like a contour map draws elevation.
Here we see the “ridge” of points that occur at low values of carat and price.
image
By default, density2d zooms in on the region that contains density lines. This may not be the same region spanned by the data points. If you like, you can re-expand the graph to the region spanned by the price and carat variables with expand_limits().
expand_limits() zooms the x and y axes to the fit the range of any two variables (they need not be the original x and y variables).
####
Exercise - density2d
Often density2d plots are easiest to read when you plot them on top of the original data. In the chunk below create a plot of diamond carat size vs. price. The plot should contain density2d lines superimposed on top of the raw points. Make the raw points transparent with an alpha of 0.1.
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_point(alpha = 0.1) +
geom_density_2d()
"Good job! Plotting a summay on top of raw values is a common pattern in data science."
Overplotting is a common phenomenon in plots because the causes of overplotting area common phenomenon in data sets. Data sets often
When overplotting results from rounding errors, you can work around it by manipulating the transparency or location of the points.
For larger datasets you can use geoms that summarise the data to display relationships without overplotting. This is an effective tactic for truly big data as well, and it also works for the first case of overplotting due to rounding.
One final tactic is to sample your data to create a sample data set that is small enough to visualize without overplotting.
You’ve now learned a complete toolkit for exploring data visually. The final tutorial in this primer will show you how to polish the plots you make for publication. Instead of learning how to visualize data, you will learn how to add titles and captions, customize color schemes and more.
Learn to adjust color schemes, titles, legends, and more to make your plots perfect for publication.
This tutorial will teach you how to customize the look and feel of your plots. You will learn how to:
The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.
The tutorial uses the ggplot2, dplyr, scales, ggthemes, and viridis packages, which have been pre-loaded for your convenience.
n the previous tutorials, you learned how to visualize data with graphs. Now let’s look at how to customize the look and feel of your graphs. To do that we will need to begin with a graph that we can customize.
In the chunk below, make a plot that uses boxplots to display the relationship between the cut and price variables from the diamonds dataset.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = cut, y = price))
Since we want to use this plot again later, let’s go ahead and save it.
p <- ggplot(diamonds) +
geom_boxplot(mapping = aes(x = cut, y = price))
Now whenever you call p, R will draw your plot. Try it and see.
p
"Good job! By the way, have you taken a moment to look at what the plot shows? Let's do that now."
Our plot shows something surprising: when you group diamonds by cut, the worst cut diamonds have the highest median price. It’s a little hard to see in the plot, but you can verify it with some data manipulation.
diamonds %>%
group_by(cut) %>%
summarise(median = median(price))
## # A tibble: 5 × 2
## cut median
## <ord> <dbl>
## 1 Fair 3282
## 2 Good 3050.
## 3 Very Good 2648
## 4 Premium 3185
## 5 Ideal 1810
The
difference between median prices is hard to see in our plot because each
group contains distant outliers.
We can make the difference easier to see by zooming in on the low values of y, where the medians are located. There are two ways to zoom with ggplot2: with and without clipping.
Clipping refers to how R should treat the data that falls outside of the zoomed region. To see its effect, look at these plots. Each zooms in on the region where price is between $0 and $7,500.
* The plot on the
left zooms by clipping. It removes all of the data points that fall
outside of the desired region, and then plots the data points that
remain. * The plot on the right zooms without clipping. You can think of
it as drawing the entire graph and then zooming into a certain
region.
Of these, zooming by clipping is the easiest to do. To zoom your graph on the x axis, add the function xlim() to the plot call. To zoom on the y axis add the function ylim(). Each takes a minimum value and a maximum value to zoom to, like this
some_plot +
xlim(0, 100)
Use ylim() to recreate our plot on the left from above. The plot zooms the y axis from 0 to 7,500 by clipping.
p
p + ylim(0, 7500)
## Warning: Removed 8382 rows containing non-finite values (`stat_boxplot()`).
Zooming by clipping is a bad idea for boxplots. ylim() fundamentally changes the information conveyed in the boxplots because it throws out some of the data before drawing the boxplots. Those aren’t the medians of the entire data set that we are looking at.
How then can we zoom without clipping?
To zoom without clipping, set the xlim and/or ylim arguments of your plot’s coord_ function. Each takes a numeric vector of length two (the minimum and maximum values to zoom to).
This is easy to do if your plot explicitly calls a coord_ function
p + coord_flip(ylim = c(0, 7500))
But what if your plot doesn’t call a coord_ function? Then your plot is using Cartesian coordinates (the default). You can adjust the limits of your plot without changing the default coordinate system by adding coord_cartesian() to your plot.
Try it below. Use coord_cartesian() to zoom p to the region where price falls between 0 and 7500.
p + coord_cartesian(ylim = c(0, 7500))
"Good job! Now it is much easier to see the differences in the median."
Notice that our code so far has used p to make a plot, but it hasn’t changed the plot that is saved inside of p. You can run p by itself to get the unzoomed plot.
p
I like the zooming, so I’m purposefully going to overwrite the plot stored in p so that it uses it.
p <- p + coord_cartesian(ylim = c(0, 7500))
p
The relationship in our plot is now easier to see, but that doesn’t mean that everyone who sees our plot will spot it. We can draw their attention to the relationship with a label, like a title or a caption.
To do this, we will use the labs() function. You can think of labs() as an all purpose function for adding labels to a ggplot2 plot.
Give labs() a title argument to add a title.
p + labs(title = "The title appears here")
Give labs() a subtitle argument to add a subtitle. If you use multiple arguments, remember to separate them with a comma.
p + labs(title = "The title appears here",
subtitle = "The subtitle appears here, slightly smaller")
Give labs() a caption argument to add a caption. I like to use captions to cite my data source.
p + labs(title = "The title appears here",
subtitle = "The subtitle appears here, slightly smaller",
caption = "Captions appear at the bottom.")
Plot p with a set of informative labels. for learning purposes, be sure to use a title, subtitle, and caption.
p + labs(title = "Diamond prices by cut",
subtitle = "Fair cut diamonds fetch the highest median price. Why?",
caption = "Data collected by Hadley Wickham")
"Good job! By the way, why *do* fair cut diamonds fetch the highest price?"
Perhaps a diamond’s cut is conflated with its carat size. If fair cut diamonds tend to be larger diamonds that would explain their larger prices. Let’s test this.
Make a plot that displays the relationship between carat size, price, and cut for all diamonds. How do you interpret the results? Give your plot a title, subtitle, and caption that explain the plot and convey your conclusions.
If you are looking for a way to start, I recommend using a smooth line with color mapped to cut, perhaps overlaid on the background data.
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_smooth(mapping = aes(color = cut), se = FALSE) +
labs(title = "Carat size vs. Price",
subtitle = "Fair cut diamonds tend to be large, but they fetch the lowest prices for most carat sizes.",
caption = "Data by Hadley Wickham")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Unlike p, our new plot uses color and has a legend. Let’s save it to use later when we learn to customize colors and legends.
p1 <- ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_smooth(mapping = aes(color = cut), se = FALSE) +
labs(title = "Carat size vs. Price",
subtitle = "Fair cut diamonds tend to be large, but they fetch the lowest prices for most carat sizes.",
caption = "Data by Hadley Wickham")
annotate() provides a final way to label your graph: it adds a single geom to your plot. When you use annotate(), you must first choose which type of geom to add. Next, you must manually supply a value for each aesthetic required by the geom.
So for example, we could use annotate() to add text to our plot.
p1 + annotate("text", x = 4, y = 7500, label = "There are no cheap,\nlarge diamonds")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Notice that I select geom_text() with “text”, the suffix of the function
name in quotation marks.
In practice, I find annotate() time consuming to work with, but you can accomplish quite a lot with annotate() if you take the time.
One of the most effective ways to control the look of your plot is with a theme.
A theme describes how the non-data elements of your plot should look. For example, these two plots show the same data, but they use two very different themes.

To change the theme of your plot, add a theme_ function to your plot call. The ggplot2 package provides eight theme functions to choose from.
Use the box below to plot p1 with each of the themes. Which theme do you prefer? Which theme does ggplot2 apply by default?
p1 + theme_bw()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_classic()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_dark()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_gray()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_light()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_linedraw()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_minimal()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_void()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
"Good Job! ggplot2 uses theme_gray()` by default."
If you would like to give your graph a more complete makeover, the ggthemes package provides extra themes that imitate the graph styles of popular software packages and publications. These include:
Try plotting p1 with at least two or three of the themes mentioned above.
library(ggthemes)
p1 + theme_base()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_calc()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_economist()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_economist_white()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_excel()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_few()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_fivethirtyeight()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_foundation()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_gdocs()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_hc()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_igray()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_map()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_pander()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_par()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_solarized()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_solarized_2()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_solid()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_stata()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_tufte()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
p1 + theme_wsj()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
"Good Job! Notice that each theme supplies its own font sizes, which means that your captions might run off the page for some themes. In practice, you can fix this by resizing your graph window."
If you compare the ggtheme themes to the styles they imitate, you might notice something: the colors used to plot your data haven’t changed. The colors are noticeably ggplot2 colors. In the next section, we’ll look at how to customize this remaining part of your graph: the data elements.
Before we go on, I suggest that we update p1 to use theme_bw(). It will make our next set of modifications easier to see.
p1 <- p1 + theme_bw()
p1
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Every time you map an aesthetic to a variable, ggplot2 relies on a scale to select the specific colors, sizes, or shapes to use for the values of your variable.
A scale is an R function that works like a mathematical function; it maps each value in a data space to a level in an aesthetic space. But it may be easier to think of a scale as a “palette.” When you give your graph a color scale, you give it a palette of colors to use.
ggplot2 chooses a pleasing set of scales to use whenever you make a graph. You can change or customize these scales by adding a scale function to your plot call.
For example, the code below plots p1 in greyscale instead of the default colors.
p1 + scale_color_grey()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#### A second example
You can add scales for every aesthetic mapping, including the x and y mappings (the code below log transforms the x and y axes).
p1 +
scale_x_log10() +
scale_y_log10()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot2 supplies over 50 scales to use. This may seem overwhelming, but the scales are organized according to an intuitive naming convention.
ggplot2 scale functions follow a naming convention. Each function name contains the same three elements in order, separated by underscores:
scale_shape_manual() and scale_x_continuous() are examples of the naming scheme.
You can see the complete list of scale names at http://ggplot2.tidyverse.org/reference/. In this tutorial, we will focus on scales that work with the color aesthetic.
Scales specialize in either discrete variables or continuous variables. In other words, you would use a different set of scales to map a discrete variable, like diamond clarity, than you would use to map a continuous variable, like diamond price.
Which type of variable does p1 map to the color aesthetic?
Correct!
p1 maps color to cut, a discrete variable with five distinct levels.
One of the most useful color palettes for discrete variables is scale_color_brewer() (scale_fill_brewer() if you are working with fill. Run the code below to see the effect of the scale.
p1 + scale_color_brewer()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
"Good job! scale_color_brewer() applies a color palette from the RColorBrewer package, a package that specializes in attractive color palettes."
The RColorBrewer package contains a variety of palettes developed by Cynthis Brewer. Each palette is designed to look pleasing as well as to differentiate between the values represented by the palette. You can learn more about the color brewer project at colorbrewer2.org.
Altogether, the RColorBrewer package contains 35 palettes. You can see each palette and its name by running RColorBrewer::display.brewer.all(). Try it below.
library(RColorBrewer)
RColorBrewer::display.brewer.all()
"Good job! Our graph above used the Blues palette (the default)."
Last value being used to check answer is invisible. See `?invisible` for more information
By default, scale_color_brewer() will use the “Blues” palette from the RColorBrewer package. To use a different RColorBrewer palette, set the palette argument of scale_color_brewer() to one of the RColorBrewer palette names, surrounded by quotation marks, e.g.
p1 + scale_color_brewer(palette = "Purples")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
####
Exercise - scale_color_brewer()
Recreate the graph below, which uses a different palette from the RColorBrewer package.
image
p1 + scale_color_brewer(palette = "Spectral")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
image
"Good job! scale_color_brewer() is one of the most useful functions for customizing colors in ggplot2 because it does for you the hard work of selecting a pleasing combination of colors. If you'd like to select individual colors yourself, try the scale_color_manual() function."
scale_color_brewer() works with discrete variables, but what if your plot maps color to a continuous variable?
Since we do not have a plot that applies color to a continuous variable, let’s make one.
p_cont <- ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy, color = hwy)) +
theme_bw()
p_cont
If we apply scale_color_brewer() to our new plot, we get an error message that confirms what you know: you cannot use a scale that is built for discrete variables to customize the mapping to a continuous variable.
p_cont + scale_color_brewer()
## Error: Continuous value supplied to discrete scale
Luckily, scale_color_brewer() has a comes with a continuous analogue named scale_color_distiller() (also scale_fill_distiller()).
Use scale_color_distiller() just as you would scale_color_brewer(). scale_color_distiller() will take any RColorBrewer palette, and interpolate between colors as necessary to provide an entire continuous range of colors.
So for example, we could reuse the Spectral palette in our continuous plot
p_cont + scale_color_distiller(palette = "Spectral")
#### Exercise - scale_color_distiller()
Recreate the graph below, which uses a different palette from the RColorBrewer package.
image
p_cont + scale_color_distiller(palette = "BrBG")
"Good job! ggplot2 also supplies scale_color_gradient(), scale_color_gradient2(), and scale_color_gradientn(), which you can use to construct gradients manually between 2, 3, and n colors."
The viridis package contains a collection of very good looking color palettes for continuous variables. Each palette is designed to show the gradation of continuous values in an attractive, and perceptionally uniform way (no range of values appears more important than another). As a bonus, the palettes are both color blind and black and white printer friendly!
To add a viridis palette, use scale_color_viridis() or scale_fill_viridis(), both of which come in the viridis package.
library(viridis)
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
## The following object is masked from 'package:maps':
##
## unemp
p_cont + scale_color_viridis()
Altogether, the viridis package comes with four color palettes, named magma, plasma, inferno, and viridis.
However, you do not select the palettes by name. To select a viridis color palette, set the option argument of scale_color_viridis() to one of “A” (magma), “B” (plasma), “C” (inferno), or “D” (viridis).
Try each option with p_cont below. Determine which is the default.
p_cont + scale_color_viridis("A")
p_cont + scale_color_viridis("B")
p_cont + scale_color_viridis("C")
p_cont + scale_color_viridis("D") # D is default. See ? scale_color_viridis
"Good job! Option D is the default if you do not select an option."
The last piece of a ggplot2 graph to customize is the legend. When it comes to legends, you can customize the:
Customizing legends is a little more chaotic than customizing other parts of the graph, because the information that appears in a legend comes from several different places.
To change the position of a legend in a ggplot2 graph add one of the below to your plot call:
+ theme(legend.position = “bottom”)+ theme(legend.position = “top”)+ theme(legend.position = “left”)+ theme(legend.position = “right”) (the default)Try this now. Move the legend in p_cont to the bottom of the graph.
p_cont + theme(legend.position = "bottom")
"Good job! If you move the legend to the top or bottom of the plot, ggplot2 will reogranize the orientation of the legend from vertical to horizontal."
Theme functions like theme_grey() and theme_bw() also adjust the legend position (among all of the other details they orchestrate). So if you use theme(legend.position = “bottom”) in your plots, be sure to add it after any theme_ functions you call, like this
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy, color = hwy)) +
theme_bw() +
theme(legend.position = "bottom")
If you do this, ggplot2 will apply all of the settings of theme_bw(),
and then overwrite the legend position setting to “bottom” (instead of
vice versa).
You may have noticed that color and fill legends take two forms. If you map color (or fill) to a discrete variable, the legend will look like a standard legend. This is the case for the bottom legend below.
If you map color (or fill) to a continuous legend, your legend will look like a colorbar. This is the case in the top legend below. The color bar helps convey the continuous nature of the variable.
You can use the guides() function to change the type or presence of each legend in the plot. To use guides(), type the name of the aesthetic whose legend you want to alter as and argument name. Then set it to one of
p_legend + guides(fill = "legend", color = "none")
image
p_cont
p_cont + guides(fill = "legend", color = "none")
p_legend <- ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy, color = class, fill = hwy),
shape = 21, size = 3, stroke = 1) +
theme_bw()
p_legend + guides(fill = "none", color = "none")
image
p_cont + guides(fill = "none", color = "none")
"Good job! If you move the legend to the top or bottom of the plot, ggplot2 will reogranize the orientation of the legend from vertical to horizontal."
To control the title and labels of a legend, you must turn to the scale_ functions. Each scale_ function takes a name and a labels argument, which it will use to build the legend associated with the scale. The labels argument should be a vector of strings that has one string for each label in the default legend.
So for example, you can adjust the legend of p1 with
p1 + scale_color_brewer(name = "Cut Grade", labels = c("Very Bad", "Bad", "Mediocre", "Nice", "Very Nice"))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
This is handy, but it raises a question: what if you haven’t invoked a scale_ function to pass labels to? For example, the graph below relies on the default scales.
p1
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
In this case, you need to identify the default scale used by the plot and then manually add that scale to the plot, setting the labels as you do.
For example, our plot above relies on the default color scale for a discrete variable, which happens to be scale_color_discrete(). If you know this, you can relabel the legend like so:
p1 + scale_color_discrete(name = "Cut Grade", labels = c("Very Bad", "Bad", "Mediocre", "Nice", "Very Nice"))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
As you can see, it is handy to know which scales a ggplot2 graph will use by default. Here’s a short list.
| aesthetic | variable | default |
|---|---|---|
| x | continuous | scale_x_continuous() |
| discrete | scale_x_discrete() | |
| y | continuous | scale_y_continuous() |
| discrete | scale_y_discrete() | |
| color | continuous | scale_color_continuous() |
| discrete | scale_color_discrete() | |
| fill | continuous | scale_fill_continuous() |
| discrete | scale_fill_discrete() | |
| size | continuous | scale_size() |
| shape | discrete | scale_shape() |
In this tutorial, you learned how to customize the graphs that you make with ggplot2 in several ways. You learned how to:
To cement your skills, combine what you’ve learned to recreate the plot below.
image
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_point() +
geom_smooth(mapping = aes(color = cut), se = FALSE) +
labs(title = "Ideal cut diamonds command the best price for every carat size",
subtitle = "Lines show GAM estimate of mean values for each level of cut",
caption = "Data provided by Hadley Wickham",
x = "Log Carat Size",
y = "Log Price Size",
color = "Cut Rating") +
scale_x_log10() +
scale_y_log10() +
scale_color_brewer(palette = "Greens") +
theme_light()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Unlock the tidyverse by learning how to make and use tidy data, the data format designed for R.
Data comes in many formats, but R prefers just one: Tidy Data. Learn to recognise and make tidy data in this tutorial, as well as how to reshape the layout of any data set.
The tools that you learned in the previous Primers work best when your data is organized in a specific way. This format is known as tidy data and it appears throughout the tidyverse. You will spend a lot of time as a data scientist wrangling your data into a useable format, so it is important to learn how to do this fast.
This tutorial will teach you how to recognize tidy data, as well as how to reshape untidy data into a tidy format. In it, you will learn the core data wrangling functions for the tidyverse:
pivot_longer()]pivot_wider()]This tutorial uses the core tidyverse packages, including ggplot2, dplyr, and tidyr, as well as the babynames package. All of these packages have been pre-installed and pre-loaded for your convenience.
Click the Next Topic button to begin.
In Exploratory Data Analysis, we proposed three definitions that are useful for data science:
A variable is a quantity, quality, or property that you can measure.
A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
An observation is a set of measurements that are made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a case or data point.
These definitions are tied to the concept of tidy data. To see how, let’s apply the definitions to some real data.
table1
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
What are the variables in the data set above. Check all that apply.
Good Job! The data set contains four variables measured on six observations: country, year, cases, and population.
Now consider this data set. Does it contain the same variables?
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
Does the data above contain the variables country, year, cases, and population?
Correct!
If you look closely, you will see that this is the same data set as before, but organized in a new way.
These data sets reveal something important: you can reorganize the same set of variables, values, and observations in many different ways.
It’s not hard to do. If you run the code chunks below, you can see the same data displayed in three more ways.
table3
## # A tibble: 6 × 3
## country year rate
## <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table4a; table4b
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
table5
## # A tibble: 6 × 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
Data can come in a variety of formats, but one format is easier to use in R than the others. This format is known as tidy data. A data set is tidy if:
Among our tables above, only table1 is tidy.
table1
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
To see why tidy data is easier to use, consider a basic task. Each code chunk below extracts the values of the cases variable as a vector and computes the mean of the variable. One uses a tidy table, table1:
mean(table1$cases)
## [1] 91276.67
The other uses an untidy table, table2:
mean(table2$count[c(1, 3, 5, 7, 9, 11)])
## [1] 91276.67
Which line of code is easier to write? Which line could you write if you’ve only looked at the first row of the data?
Not only is the code for table1 easier to write, it is easier to reuse. To see what I mean, modify the code chunks below to compute the mean of the population variable for each table.
First with table1:
mean(table1$cases)
## [1] 91276.67
Then with table2:
mean(table2$count[c(1, 3, 5, 7, 9, 11)])
## [1] 91276.67
Again table1 is easier to work with; you only need to change the name of the variable that you wish to extract. Code like this is easier to generalize to new data sets (if they are tidy) and easier to automate with a function.
Let’s look at one more advantage.
Suppose you would like to compute the ratios of cases to population for each country and each year. To do this, you need to ensure that the correct value of cases is paired with the correct value of population when you do the calculation.
Again, this is hard to do with untidy table2:
table2$count[c(1, 3, 5, 7, 9, 11)] / table2$count[c(2, 4, 6, 8, 10, 12)]
## [1] 0.0000372741 0.0001294466 0.0002193930 0.0004612363 0.0001667495
## [6] 0.0001669488
But it is easy to do with tidy table1. Give it a try below:
table1$cases / table1$population
## [1] 0.0000372741 0.0001294466 0.0002193930 0.0004612363 0.0001667495
## [6] 0.0001669488
These small differences may seem petty, but they add up over the course of a data analysis, stealing time and inviting mistakes.
The tidy data format works so well for R because it aligns the structure of your data with the mechanics of R:
R stores each data frame as a list of column vectors, which makes it easy to extract a column from a data frame as a vector. Tidy data places each variable in its own column vector, which makes it easy to extract all of the values of a variable to compute a summary statistic, or to use the variable in a computation.
R computes many functions and operations in a vectorized fashion, matching the first values of each vector of input to compute the first result, matching the second values of each input to compute the second result, and so on. Tidy data ensures that R will always match values with other values from the same operation whenever vector inputs are drawn from the same table.
As a result, most
functions in R—and every function in the tidyverse—will expect your data
to be organized into a tidy format. (You may have noticed above that we
could use dplyr functions to work on table1, but not on table2).
“Data comes in many formats, but R prefers just one: tidy data.” — Garrett Grolemund
A data set is tidy if:
Now that you know what tidy data is, what can you do about untidy data?
“Tidy data sets are all alike; but every messy data set is messy in its own way.” — Hadley Wickham
How you tidy an untidy data set will depend on the initial configuration of the data. For example, consider the cases data set below.
cases
## # A tibble: 3 × 4
## Country `2011` `2012` `2013`
## <chr> <dbl> <dbl> <dbl>
## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000
What are the variables in cases?
Correct!
Video: https://vimeo.com/229581273
You can use the gather() function in the tidyr package to convert wide data to long data. Notice that gather() returns a tidy copy of the dataset, but does not alter the original dataset. If you wish to use this copy later, you’ll need to save it somewhere.
cases %>% gather(key = "year", value = "n", 2, 3, 4)
## # A tibble: 9 × 3
## Country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 DE 2011 5800
## 3 US 2011 15000
## 4 FR 2012 6900
## 5 DE 2012 6000
## 6 US 2012 14000
## 7 FR 2013 7000
## 8 DE 2013 6200
## 9 US 2013 13000
# pivot_longer
cases %>% pivot_longer(cols = 2:4, names_to = "year", values_to = "n")
## # A tibble: 9 × 3
## Country year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
## 6 DE 2013 6200
## 7 US 2011 15000
## 8 US 2012 14000
## 9 US 2013 13000
Let’s take a closer look at the gather() syntax.
Here’s the same call written without the pipe operator, which makes the syntax easier to see.
gather(cases, key = "year", value = "n", 2, 3, 4)
To use gather(), pass it the name of a data set to reshape followed by two new column names to use. Each name should be a character string surrounded by quotes:
Finally, use numbers to tell gather() which columns to use to build the new columns. Here gather will use the second, third, and fourth columns. gather() will remove these columns from the results, but their contents will appear in the new columns. Any unspecified columns will remain in the dataset, their contents repeated as often as necessary to duplicate each relationship in the original untidy data set.
[To be replaced with a video]
gather() relies on the idea of key:value pairs. A key value pair is a pair that lists a value alongside the name of the variable that the value describes. (We could store every value in a dataset as a key value pair, but this is not how R works.)
In a tidy data set, you will find “keys”—that is variable names—in the column names of the data set. The values will appear in the cells of the columns. Here we know that the key for each value in the year column is year. This arrangement reduces duplication.
Sometimes you will also find key value pairs listed beside each other in two separate columns, as in table2. Here the type column lists the keys that are associated with the count column. This layout is sometimes called “narrow” data.
Tidyr functions rely on the key value vocabulary to describe what should go where. In gather() the key argument describes the new column that contains the values that previously appeared in the tidy key position, i.e. in the column names. The value argument describes the new column that contains the values that previously appeared in the value positions, e.g. in the column cells.
Now that you’ve seen gather() in action, try using it to tidy table4a:
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
cases %>% gather(key = "year", value = "n", 2, 3, 4)
table4a %>% gather(key = "year", value = "n", 2, 3)
## # A tibble: 6 × 3
## country year n
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
# pivot_longer
table4a %>% pivot_longer(cols = 2:3, names_to = "year", values_to = "n")
## # A tibble: 6 × 3
## country year n
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
"Good job!"
So far we’ve used numbers to describe which columns to reshape with gather(), but this isn’t necessary. gather() also recognizes column names as well as all of the select() helpers that you learned about in Isolating Data with dplyr. So for example, these expressions would all do the same thing:
table4a %>% gather(key = "year", value = "cases", 2, 3)
table4a %>% gather(key = "year", value = "cases", `1999`, `2000`)
table4a %>% gather(key = "year", value = "cases", -country)
table4a %>% gather(key = "year", value = "cases", one_of(c("1999", "2000")))
Notice that 1999 and 2000 are numbers. When you directly call column names that are numbers, you need to surround the names with backticks (otherwise gather() would think you mean the 1999th and 2000th columns). Use ?select_helpers to open a help page that lists the select helpers.
Use gather() and the - helper to tidy table4b into a dataset with three columns: country, year, and population.
table4b
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
table4b %>% gather(key = "year", value = "population", -country)
## # A tibble: 6 × 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583
# pivot_longer
table4b %>% pivot_longer(cols = -country, names_to = "year", values_to = "population")
## # A tibble: 6 × 3
## country year population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Afghanistan 2000 20595360
## 3 Brazil 1999 172006362
## 4 Brazil 2000 174504898
## 5 China 1999 1272915272
## 6 China 2000 1280428583
If you looked closely at your results in the previous exercises, you
may have noticed something odd: the new year column contains character
vectors. You can tell because R displays
table4b %>% gather(key = "year", value = "population", -country, convert = TRUE)
## # A tibble: 6 × 3
## country year population
## <chr> <int> <int>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583
# pivot_longer
table4b %>% pivot_longer(cols = -country, names_to = "year", values_to = "population", names_transform = list(year = as.integer))
## # A tibble: 6 × 3
## country year population
## <chr> <int> <int>
## 1 Afghanistan 1999 19987071
## 2 Afghanistan 2000 20595360
## 3 Brazil 1999 172006362
## 4 Brazil 2000 174504898
## 5 China 1999 1272915272
## 6 China 2000 1280428583
You can ask R to convert each new column to an appropriate data type by adding convert = TRUE to the gather() call. R will inspect the contents of the columns to choose the most likely data type. Give it a try in the code above!
cases, table4a, and table4b are all rectangular tables:
Rectangular tables are a simple form of wide data. But you will also encounter more complicated examples of wide data. For example, it is common for researchers to place one subject per row. In this case, you might see several columns of identifying information followed by a set of columns that list repeated measurements of the same variable. cases2 emulates such a data set.
cases2
## # A tibble: 3 × 6
## city country continent `2011` `2012` `2013`
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Paris FR Europe 7000 6900 7000
## 2 Berlin DE Europe 5800 6000 6200
## 3 Chicago US North America 15000 14000 13000
cases2 %>% gather(key = "year", value = "cases", 4:6)
## # A tibble: 9 × 5
## city country continent year cases
## <chr> <chr> <chr> <chr> <dbl>
## 1 Paris FR Europe 2011 7000
## 2 Berlin DE Europe 2011 5800
## 3 Chicago US North America 2011 15000
## 4 Paris FR Europe 2012 6900
## 5 Berlin DE Europe 2012 6000
## 6 Chicago US North America 2012 14000
## 7 Paris FR Europe 2013 7000
## 8 Berlin DE Europe 2013 6200
## 9 Chicago US North America 2013 13000
# pivot_longer
cases2 %>% pivot_longer(cols = 4:6, names_to = "year", values_to = "cases")
## # A tibble: 9 × 5
## city country continent year cases
## <chr> <chr> <chr> <chr> <dbl>
## 1 Paris FR Europe 2011 7000
## 2 Paris FR Europe 2012 6900
## 3 Paris FR Europe 2013 7000
## 4 Berlin DE Europe 2011 5800
## 5 Berlin DE Europe 2012 6000
## 6 Berlin DE Europe 2013 6200
## 7 Chicago US North America 2011 15000
## 8 Chicago US North America 2012 14000
## 9 Chicago US North America 2013 13000
The pollution dataset below displays the amount of small and large particulate in the air of three cities. It illustrates another common type of untidy data. Narrow data uses a literal key column and a literal value column to store multiple variables. Can you tell here which is which?
pollution
## # A tibble: 6 × 3
## city size amount
## <chr> <chr> <dbl>
## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
## 6 Beijing small 121
pollution
## # A tibble: 6 × 3
## city size amount
## <chr> <chr> <dbl>
## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
## 6 Beijing small 121
Which column in pollution contains key names (i.e. variable names)?
Correct!
Two properties are being measured in this data: 1) the amount of small particulate in the air, and 2) the amount of large particulate
pollution
## # A tibble: 6 × 3
## city size amount
## <chr> <chr> <dbl>
## 1 New York large 23
## 2 New York small 14
## 3 London large 22
## 4 London small 16
## 5 Beijing large 121
## 6 Beijing small 121
Which column in pollution contains the values associated with the key names?
Correct!
What do these numbers represent? You can only tell when you match them with the variable names large (for large particulate) and small (for small particulate).
Video: https://vimeo.com/229581273
You can “spread” the keys in a key column across their own set of columns with the spread() function in the tidyr package. To use spread() pass it the name of a data set to spread (provided here by the pipe %>%). Then tell spread which column to use as a key column and which column to use as a value column.
pollution %>% spread(key = size, value = amount)
## # A tibble: 3 × 3
## city large small
## <chr> <dbl> <dbl>
## 1 Beijing 121 121
## 2 London 22 16
## 3 New York 23 14
# pivot_wider
pollution %>% pivot_wider(names_from = size, values_from = amount)
## # A tibble: 3 × 3
## city large small
## <chr> <dbl> <dbl>
## 1 New York 23 14
## 2 London 22 16
## 3 Beijing 121 121
spread() will give each unique value in the key column its own column. The name of the value will become the column name. spread() will then redistribute the values in the value column across the new columns in a way that preserves every relationship in the original dataset.
Use spread() to tidy table2 into a dataset with four columns: country, year, cases, and population. In short, convert table2 to look like table1.
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table2 %>% spread(key = type, value = count)
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
# pivot_wider
table2 %>% pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 × 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
You may notice that both gather() and spread() take key and value arguments. And, in each case the arguments are set to column names. But in the gather() you must surround the names with quotes and in the spread() case you do not. Why is this?
table4b %>% gather(key = "year", value = "population", -country)
pollution %>% spread(key = size, value = amount)
Don’t let the difference trip you up. Instead think about what the quotes mean.
In our gather() code above, “year” and “population” refer to two columns that do not yet exist. If R tried to look for objects named year and population it wouldn’t find them (at least not in the table4b dataset). When we use gather() we are passing R two values (character strings) to use as the name of future columns that will appear in the result.
In our spread() code, key and value point to two columns that do exist in the pollution dataset: size and amount. When we use spread(), we are telling R to find these objects (columns) in the dataset and to use their contents to create the result. Since they exist, we do not need to surround them in quotation marks.
In practice, whether or not you need to use quotation marks will depend on how the author of your function wrote the function (For example, spread() will still work if you do include quotation marks). However, you can use the intuition above as a guide for how to use functions in the tidyverse.
Let’s apply spread() to a real world inquiry. The plot below visualizes an aspect of the babynames data set from the babynames package. (See Work with Data for an introduction to the babynames data set.)
The ratio of
girls to boys in babynames is not constant across time. We can explore
this phenomenon further by recreating the data in the plot.
image
To make the data displayed in the plot above, I first grouped babynames by year and sex. Then I computed a summary for each group: total, which is equal to the sum of n for each group.
Use dplyr functions to recreate this process in the chunk below.
babynames %>%
group_by(year, sex) %>%
summarize(total = sum(n))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 276 × 3
## year sex total
## <dbl> <chr> <int>
## 1 1880 F 90993
## 2 1880 M 110491
## 3 1881 F 91953
## 4 1881 M 100743
## 5 1882 F 107847
## 6 1882 M 113686
## 7 1883 F 112319
## 8 1883 M 104627
## 9 1884 F 129020
## 10 1884 M 114442
## # … with 266 more rows
image
babynames %>%
group_by(year, sex) %>%
summarise(total = sum(n)) %>%
ggplot() +
geom_line(mapping = aes(x = year, y = total, color = sex))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
A better way to explore this phenomena would be to directly plot a ratio of boys to girls over time. To make such a plot, you would need to compute the ratio of boys to girls for each year from 1880 to 2015:
\[\mbox{ratio male} = \frac{\mbox{total male}}{\mbox{total female}}\]
But how can we plot this data? Our current iteration of babynames places the total number of boys and girls for each year in the same column, which makes it hard to use both totals in the same calculation.
babynames %>%
group_by(year, sex) %>%
summarise(total = sum(n))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 276 × 3
## year sex total
## <dbl> <chr> <int>
## 1 1880 F 90993
## 2 1880 M 110491
## 3 1881 F 91953
## 4 1881 M 100743
## 5 1882 F 107847
## 6 1882 M 113686
## 7 1883 F 112319
## 8 1883 M 104627
## 9 1884 F 129020
## 10 1884 M 114442
## # … with 266 more rows
It would be easier to calculate the ratio of boys to girls if we could reshape our data to place the total number of boys born per year in one column and the total number of girls born per year in another:
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 138 × 3
## year F M
## <dbl> <int> <int>
## 1 1880 90993 110491
## 2 1881 91953 100743
## 3 1882 107847 113686
## 4 1883 112319 104627
## 5 1884 129020 114442
## 6 1885 133055 107799
## 7 1886 144533 110784
## 8 1887 145981 101413
## 9 1888 178622 120851
## 10 1889 178366 110580
## # … with 128 more rows
Then we could compute the ratio by piping our data into a call like mutate(ratio = M / F).
Add to the code below to:
babynames %>%
group_by(year, sex) %>%
summarise(total = sum(n)) %>%
spread(key = sex, value = total) %>%
mutate(ratio = M / F) %>%
ggplot() +
geom_line(mapping = aes(x = year, y = ratio))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
Our results reveal a conspicuous oddity, that is easier to interpret if we turn the ratio into a percentage.
The percent of
recorded male births is unusually low between 1880 and 1936. What is
happening? One insight is that the data comes from the United States
Social Security office, which was only created in 1936. As a result, we
can expect the data prior to 1936 to display a survivorship bias.
Your data will be easier to work with in R if you reshape it into a tidy layout at the start of your analysis. Data is tidy if:
You can use gather() and spread(), or some iterative sequence of the two, to reshape your data into any possible configuration that:
In particular, you can use these functions to recast your data into a tidy layout.
It is not always clear whether or not a data set is tidy. For example, the version of babynames that was tidy when we wanted to plot total children by year, was no longer tidy when we wanted to compute the ratio of male to female children.
The ambiguity comes from the definition of tidy data. Tidiness depends on the variables in your data set. But what is a variable depends on what you are trying to do.
To identify the variables that you need to work with, describe what you want to do with an equation. Each variable in the equation should correspond to a variable in your data.
So in our first case, we wanted to make a plot with the following mappings (e.g. equations)
\[x = \mbox{year}\] \[y=\mbox{total}\] \[\mbox{color}=\mbox{sex}\]
To do this, we needed a data set that placed year, total, and sex each in their own columns.
In our second case we wanted to compute ratio, where
\[\mbox{ratio}=\frac{\mbox{male}}{\mbox{female}}\]
This formula has three variables: ratio male, total male, and total female. To create the first variable, we required a data set that isolated the second and third variables (total male and total female) in their own columns.
Here you will learn to separate a column into multiple columns and to reverse the process by uniting multiple columns into a single column. Then you’ll practice your data wrangling skills on messy real world data.
Data is easiest to analyze in R when it is stored in a tidy format. In the last tutorial, you learned how to tidy data that has an untidy layout, but there is another way that data sets can be untidy: a data set can combine multiple values in a single cell or spread a single value across multiple cells. This makes it difficult to extract and use values in your analysis.
This tutorial will teach you two tools that you can use to tidy this type of data:
It ends with a case study that requires you to use all of the tidy tools to wrangle a messy real world data set.
This tutorial uses the core tidyverse packages, including tidyr. All of these packages have been pre-installed and pre-loaded for your convenience.
Click the Next Topic button to begin.
The hurricanes data set contains historical information about five hurricanes. At first glance it appears to contain four variables: name, wind_speed, pressure, and date. However, there are three more variables hidden in plain sight. Can you spot them?
## # A tibble: 6 × 4
## name wind_speed pressure date
## <chr> <dbl> <dbl> <chr>
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
## 4 Ana 40 1013 1997-06-30
## 5 Arlene 50 1010 1999-06-11
## 6 Arthur 45 1010 1996-06-17
Which variables are “hidden” in hurricanes? Check three.
Good job! The date variable also displays the year, month, and day associated with each measurement.
Did you realize that dates are a combination of multiple variables? They are.
You’ll almost always display these variables together to make a date, because a date is itself a variable—one that conveys more than the sum of its parts.
However, there are times where it is convenient to treat each element of a date separately. For example, what if you wanted to filter hurricanes to just the storms that occurred in June (i.e. month == 6)? Then it would be convenient to reorganize the data to look like this.
## # A tibble: 6 × 4
## name wind_speed pressure date
## <chr> <dbl> <dbl> <chr>
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
## 4 Ana 40 1013 1997-06-30
## 5 Arlene 50 1010 1999-06-11
## 6 Arthur 45 1010 1996-06-17
But how could you do it?
You can separate the elements of date with the separate() function. separate() divides a column of values into multiple columns that each contain a portion of the original values.
Run the code below to see separate() in action. Then click continue to learn about the syntax.
hurricanes %>%
separate(col = date, into = c("year","month","day"), sep = "-")
## # A tibble: 6 × 6
## name wind_speed pressure year month day
## <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Alberto 110 1007 2000 08 03
## 2 Alex 45 1009 1998 07 27
## 3 Allison 65 1005 1995 06 03
## 4 Ana 40 1013 1997 06 30
## 5 Arlene 50 1010 1999 06 11
## 6 Arthur 45 1010 1996 06 17
"Good job! As with other tidyverse functions, `separate()` returns a modified copy of the orginal data. You will need to save the copy if you wish to use it later."
Let’s rewrite our above command without the pipe, to make the syntax of separate() easier to see.
separate(hurricanes, col = date, into = c("year","month","day"), sep = "-")
separate() takes a data frame and then the name of a column in the data frame to separate. Here our code will separate the date column of the hurricane data set.
The sep = “-” argument tells separate() to split each value in date wherever a - appears. You can choose to split on any character or character string.
Separating on - will split each date into three dates: a year, month, and day. As a result, separate() will need to add three new columns to the result. The into argument gives separate() a character vector of names to use for the new columns. Since the result will have three new columns, this vector will need to have three new names. separate() will provide an error message if it ends up creating fewer or more columns than column names.
By default separate() will separate values at the location of any non-alphanumeric character, like -, ,, /, etc. So for example, we could run our code without the sep = “-” argument and—in this case—get the same result.
Or will we? Do a quick mental check and then run the code to see if you are right.
hurricanes %>%
separate(col = date, into = c("year","month","day"))
## # A tibble: 6 × 6
## name wind_speed pressure year month day
## <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Alberto 110 1007 2000 08 03
## 2 Alex 45 1009 1998 07 27
## 3 Allison 65 1005 1995 06 03
## 4 Ana 40 1013 1997 06 30
## 5 Arlene 50 1010 1999 06 11
## 6 Arthur 45 1010 1996 06 17
'Good job! "-" is the only non-alphanumeric character used in our dates, which means that the defaults return the same output as setting sep = "-"'
If you set sep equal to an integer, separate() will split the values at the location indicated by the integers. For example,
Think you have it? Create this version of hurricanes by adding a second call to separate() that uses an integer separator to the code below:
## # A tibble: 6 × 7
## name wind_speed pressure century year month day
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 Alberto 110 1007 20 00 08 03
## 2 Alex 45 1009 19 98 07 27
## 3 Allison 65 1005 19 95 06 03
## 4 Ana 40 1013 19 97 06 30
## 5 Arlene 50 1010 19 99 06 11
## 6 Arthur 45 1010 19 96 06 17
hurricanes %>%
separate(col = date, into = c("year","month","day")) %>%
separate(col = year, into = c("century", "year"), sep = 2)
## # A tibble: 6 × 7
## name wind_speed pressure century year month day
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 Alberto 110 1007 20 00 08 03
## 2 Alex 45 1009 19 98 07 27
## 3 Allison 65 1005 19 95 06 03
## 4 Ana 40 1013 19 97 06 30
## 5 Arlene 50 1010 19 99 06 11
## 6 Arthur 45 1010 19 96 06 17
Would these two commands return the same result? Why or why not? Once you have an answer, run the code below to see if you were right.
hurricanes %>%
separate(col = pressure, into = c("first", "last"), sep = 1)
## # A tibble: 6 × 5
## name wind_speed first last date
## <chr> <dbl> <chr> <chr> <chr>
## 1 Alberto 110 1 007 2000-08-03
## 2 Alex 45 1 009 1998-07-27
## 3 Allison 65 1 005 1995-06-03
## 4 Ana 40 1 013 1997-06-30
## 5 Arlene 50 1 010 1999-06-11
## 6 Arthur 45 1 010 1996-06-17
"When sep = 1, separate() splits after the first character"
hurricanes %>%
separate(col = pressure, into = c("first", "last"), sep = "1")
## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [4, 5, 6].
## # A tibble: 6 × 5
## name wind_speed first last date
## <chr> <dbl> <chr> <chr> <chr>
## 1 Alberto 110 "" 007 2000-08-03
## 2 Alex 45 "" 009 1998-07-27
## 3 Allison 65 "" 005 1995-06-03
## 4 Ana 40 "" 0 1997-06-30
## 5 Arlene 50 "" 0 1999-06-11
## 6 Arthur 45 "" 0 1996-06-17
When sep = "1", separate() splits at every appearance of the character "1". This happens because R treats a 1 surrounded by quotation marks as a character string, not a number.'"When sep = 1, separate() splits after the first character"
You may have noticed that separate() returns its results as columns of character strings. However, in some cases, like ours, the columns will contain integers, doubles, or other types of non-character data.
You can ask separate() to convert the new columns to an appropriate data type by adding convert = TRUE to your separate() call. This is identical to the convert = TRUE argument of gather().
Identify the data types of year, month, and day (they appear under the column names) in the output below. Then add convert = TRUE and re-run the code. What changes?
hurricanes %>%
separate(col = date, into = c("year","month","day"), convert = TRUE)
## # A tibble: 6 × 6
## name wind_speed pressure year month day
## <chr> <dbl> <dbl> <int> <int> <int>
## 1 Alberto 110 1007 2000 8 3
## 2 Alex 45 1009 1998 7 27
## 3 Allison 65 1005 1995 6 3
## 4 Ana 40 1013 1997 6 30
## 5 Arlene 50 1010 1999 6 11
## 6 Arthur 45 1010 1996 6 17
Let’s take a look at one last argument for separate(). If you add remove = FALSE to your separate() call, R will retain the original column in the results.
hurricanes %>%
separate(col = date, into = c("year","month","day"), convert = TRUE, remove = FALSE)
## # A tibble: 6 × 7
## name wind_speed pressure date year month day
## <chr> <dbl> <dbl> <chr> <int> <int> <int>
## 1 Alberto 110 1007 2000-08-03 2000 8 3
## 2 Alex 45 1009 1998-07-27 1998 7 27
## 3 Allison 65 1005 1995-06-03 1995 6 3
## 4 Ana 40 1013 1997-06-30 1997 6 30
## 5 Arlene 50 1010 1999-06-11 1999 6 11
## 6 Arthur 45 1010 1996-06-17 1996 6 17
You can do the inverse of separate() with unite(). unite() uses multiple input columns to create a single output column. It builds this column by pasting together the cells of the input column with a separator.
hurricanes %>%
separate(date, c("year", "month", "day"), sep = "-") %>%
unite(col = "date", month, day, year, sep = ":")
## # A tibble: 6 × 4
## name wind_speed pressure date
## <chr> <dbl> <dbl> <chr>
## 1 Alberto 110 1007 08:03:2000
## 2 Alex 45 1009 07:27:1998
## 3 Allison 65 1005 06:03:1995
## 4 Ana 40 1013 06:30:1997
## 5 Arlene 50 1010 06:11:1999
## 6 Arthur 45 1010 06:17:1996
hurricanes %>%
separate(date, c("year", "month", "day"), sep = "-") %>%
unite(col = "date", month, day, year, sep = ":")
Notice that the syntax of unite() is the inverse of separate():
Use separate() and unite() to rewrite the dates in hurricanes in the format below:
hurricanes %>% separate(date, c("year", "month", "day"), sep = "-") %>%
unite(col = date, "month", "day", "year", sep = "/")
## # A tibble: 6 × 4
## name wind_speed pressure date
## <chr> <dbl> <dbl> <chr>
## 1 Alberto 110 1007 08/03/2000
## 2 Alex 45 1009 07/27/1998
## 3 Allison 65 1005 06/03/1995
## 4 Ana 40 1013 06/30/1997
## 5 Arlene 50 1010 06/11/1999
## 6 Arthur 45 1010 06/17/1996
"Good job! Let's push it one step farther."
Use the chunk below to:
hurricanes %>%
separate(col = date, into = c("century", "rest"), sep = 2) %>%
filter(century == 19) %>%
unite(col = "date", century, rest, sep = "")
## # A tibble: 5 × 4
## name wind_speed pressure date
## <chr> <dbl> <dbl> <chr>
## 1 Alex 45 1009 1998-07-27
## 2 Allison 65 1005 1995-06-03
## 3 Ana 40 1013 1997-06-30
## 4 Arlene 50 1010 1999-06-11
## 5 Arthur 45 1010 1996-06-17
hurricanes %>%
separate(col = date, into = c("century", "rest"), sep = 2) %>%
filter(century == "19") %>%
unite(col = "date", century, rest, sep = "")
## # A tibble: 5 × 4
## name wind_speed pressure date
## <chr> <dbl> <dbl> <chr>
## 1 Alex 45 1009 1998-07-27
## 2 Allison 65 1005 1995-06-03
## 3 Ana 40 1013 1997-06-30
## 4 Arlene 50 1010 1999-06-11
## 5 Arthur 45 1010 1996-06-17
So far we’ve separated and united date, a variable that contains legitimate sub-variables. This is because it makes little sense to combine unrelated values within the same cells. However, many data sets follow this senseless practice. If you inherit one, you can use separate() and unite() to reorganize the values in a tidy fashion.
In the case study that follows, you will do just that. You will also practice using all of the tidyr functions as you do.
The who data set contains a subset of data from the World Health Organization Global Tuberculosis Report, available here.
_Probably: https://extranet.who.int/tme/generateCSV.asp?ds=notifications_
In its original format, the data is very untidy
who_orig <- read_csv("who/who_TB_notification.csv")
## Rows: 8492 Columns: 177
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): country, iso2, iso3, iso_numeric, g_whoregion
## dbl (172): year, new_sp, new_sn, new_su, new_ep, new_oth, ret_rel, ret_taf, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 8,492 × 177
## country iso2 iso3 iso_n…¹ g_who…² year new_sp new_sn new_su new_ep new_oth
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… AF AFG 004 EMR 1980 NA NA NA NA NA
## 2 Afghan… AF AFG 004 EMR 1981 NA NA NA NA NA
## 3 Afghan… AF AFG 004 EMR 1982 NA NA NA NA NA
## 4 Afghan… AF AFG 004 EMR 1983 NA NA NA NA NA
## 5 Afghan… AF AFG 004 EMR 1984 NA NA NA NA NA
## 6 Afghan… AF AFG 004 EMR 1985 NA NA NA NA NA
## 7 Afghan… AF AFG 004 EMR 1986 NA NA NA NA NA
## 8 Afghan… AF AFG 004 EMR 1987 NA NA NA NA NA
## 9 Afghan… AF AFG 004 EMR 1988 NA NA NA NA NA
## 10 Afghan… AF AFG 004 EMR 1989 NA NA NA NA NA
## # … with 8,482 more rows, 166 more variables: ret_rel <dbl>, ret_taf <dbl>,
## # ret_tad <dbl>, ret_oth <dbl>, newret_oth <dbl>, new_labconf <dbl>,
## # new_clindx <dbl>, ret_rel_labconf <dbl>, ret_rel_clindx <dbl>,
## # ret_rel_ep <dbl>, ret_nrel <dbl>, notif_foreign <dbl>, c_newinc <dbl>,
## # new_sp_m04 <dbl>, new_sp_m514 <dbl>, new_sp_m014 <dbl>, new_sp_m1524 <dbl>,
## # new_sp_m2534 <dbl>, new_sp_m3544 <dbl>, new_sp_m4554 <dbl>,
## # new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>, new_sp_f04 <dbl>, …
who
## # A tibble: 1,000 × 103
## country iso2 iso3 year new_s…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan AF AFG 1980 NA NA NA NA NA NA
## 2 Afghanistan AF AFG 1981 NA NA NA NA NA NA
## 3 Afghanistan AF AFG 1982 NA NA NA NA NA NA
## 4 Afghanistan AF AFG 1983 NA NA NA NA NA NA
## 5 Afghanistan AF AFG 1984 NA NA NA NA NA NA
## 6 Afghanistan AF AFG 1985 NA NA NA NA NA NA
## 7 Afghanistan AF AFG 1986 NA NA NA NA NA NA
## 8 Afghanistan AF AFG 1987 NA NA NA NA NA NA
## 9 Afghanistan AF AFG 1988 NA NA NA NA NA NA
## 10 Afghanistan AF AFG 1989 NA NA NA NA NA NA
## # … with 990 more rows, 93 more variables: new_sp_m65 <dbl>, new_sp_mu <dbl>,
## # new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## # new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
## # new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sp_fu <dbl>, new_sn_m04 <dbl>,
## # new_sn_m514 <dbl>, new_sn_m014 <dbl>, new_sn_m1524 <dbl>,
## # new_sn_m2534 <dbl>, new_sn_m3544 <dbl>, new_sn_m4554 <dbl>,
## # new_sn_m5564 <dbl>, new_sn_m65 <dbl>, new_sn_m15plus <dbl>, …
The first four columns of who each contain a single variable:
The remaining columns are named after codes that contain multiple variables.
Each column name after the fourth contains a code comprised of three values from three variables: type of TB, gender, and age.
image
To make who easier to use in R, we should tidy it into the format below. This data set contains six non-redundant variables: country, year, type, sex, age (group), and n (the number of cases of TB reported for each group).
## # A tibble: 12,809 × 6
## country year type sex age n
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
## 8 Afghanistan 2004 sp m 014 139
## 9 Afghanistan 2005 sp m 014 151
## 10 Afghanistan 2006 sp m 014 193
## # … with 12,799 more rows
who %>% select(-c(iso2, iso3))
## # A tibble: 1,000 × 101
## country year new_s…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶ new_s…⁷ new_s…⁸
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… 1980 NA NA NA NA NA NA NA NA
## 2 Afghan… 1981 NA NA NA NA NA NA NA NA
## 3 Afghan… 1982 NA NA NA NA NA NA NA NA
## 4 Afghan… 1983 NA NA NA NA NA NA NA NA
## 5 Afghan… 1984 NA NA NA NA NA NA NA NA
## 6 Afghan… 1985 NA NA NA NA NA NA NA NA
## 7 Afghan… 1986 NA NA NA NA NA NA NA NA
## 8 Afghan… 1987 NA NA NA NA NA NA NA NA
## 9 Afghan… 1988 NA NA NA NA NA NA NA NA
## 10 Afghan… 1989 NA NA NA NA NA NA NA NA
## # … with 990 more rows, 91 more variables: new_sp_f04 <dbl>, new_sp_f514 <dbl>,
## # new_sp_f014 <dbl>, new_sp_f1524 <dbl>, new_sp_f2534 <dbl>,
## # new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, new_sp_f5564 <dbl>,
## # new_sp_f65 <dbl>, new_sp_fu <dbl>, new_sn_m04 <dbl>, new_sn_m514 <dbl>,
## # new_sn_m014 <dbl>, new_sn_m1524 <dbl>, new_sn_m2534 <dbl>,
## # new_sn_m3544 <dbl>, new_sn_m4554 <dbl>, new_sn_m5564 <dbl>,
## # new_sn_m65 <dbl>, new_sn_m15plus <dbl>, new_sn_mu <dbl>, …
Next, we need to move the type, sex, and age variables out of the column names and into a column of their own. It is true that we want to separate these values into their own cells, but that will be easier to do once they are in their own column.
In short, we want to do something like this:
image
Add to the pipe below. Use a tidyr reshaping function to gather the column names into their own column, named “codes”. Place the column cells into a column named “n”. Hint: it may be helpful to know that there are now 58 columns in the data set.
You can think of each column name as a key that combines the values of several variables. We want to move those keys into their own key column.
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", new_sp_m014:new_rel_f65)
## # A tibble: 99,000 × 4
## country year codes n
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1980 new_sp_m014 NA
## 2 Afghanistan 1981 new_sp_m014 NA
## 3 Afghanistan 1982 new_sp_m014 NA
## 4 Afghanistan 1983 new_sp_m014 NA
## 5 Afghanistan 1984 new_sp_m014 NA
## 6 Afghanistan 1985 new_sp_m014 NA
## 7 Afghanistan 1986 new_sp_m014 NA
## 8 Afghanistan 1987 new_sp_m014 NA
## 9 Afghanistan 1988 new_sp_m014 NA
## 10 Afghanistan 1989 new_sp_m014 NA
## # … with 98,990 more rows
Our last separate, isolated two components of the who codes: new and type. However, it did not separate the sex and age variables.
If you look closely at the structure of the sexage column, you will see that each cell begins with a single letter that represents a gender, m or f, and is then followed by three or more numbers, which represent an age group. Use this insight to perform a second separate that isolates the “sex” and “age” variables:
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", new_sp_m014:new_rel_f65) %>%
separate(codes, into = c("new", "type", "sexage"), sep = "_")
## # A tibble: 99,000 × 6
## country year new type sexage n
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 1980 new sp m014 NA
## 2 Afghanistan 1981 new sp m014 NA
## 3 Afghanistan 1982 new sp m014 NA
## 4 Afghanistan 1983 new sp m014 NA
## 5 Afghanistan 1984 new sp m014 NA
## 6 Afghanistan 1985 new sp m014 NA
## 7 Afghanistan 1986 new sp m014 NA
## 8 Afghanistan 1987 new sp m014 NA
## 9 Afghanistan 1988 new sp m014 NA
## 10 Afghanistan 1989 new sp m014 NA
## # … with 98,990 more rows
Add to the pipe to remove the new variable, which doesn’t provide any useful information. (Every row in the data set shows new cases of TB and has the same value of new).
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", new_sp_m014:new_rel_f65) %>%
separate(codes, into = c("new", "type", "sexage"), sep = "_") %>%
separate(sexage, into = c("sex", "age"), sep = 1)
## # A tibble: 99,000 × 7
## country year new type sex age n
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 1980 new sp m 014 NA
## 2 Afghanistan 1981 new sp m 014 NA
## 3 Afghanistan 1982 new sp m 014 NA
## 4 Afghanistan 1983 new sp m 014 NA
## 5 Afghanistan 1984 new sp m 014 NA
## 6 Afghanistan 1985 new sp m 014 NA
## 7 Afghanistan 1986 new sp m 014 NA
## 8 Afghanistan 1987 new sp m 014 NA
## 9 Afghanistan 1988 new sp m 014 NA
## 10 Afghanistan 1989 new sp m 014 NA
## # … with 98,990 more rows
Notice that the n column of who contains the most insightful information. You do not need to take any measurments to list out the country, year, type, sex, and age combinations in the data set. In a sense, you know these combinations in advance. However, n shows how many cases of TB were reported for each combination. You do not know this information in advance, and you can only acquire it through field work—yours or someone else’s. As a result, it is concerning that our data contains so many NAs for n.
NA is R’s symbol for missing information, and it is common to have multiple NAs when you reshape your data from a wide format to a long format. The rectangular table structure imposed by wide data requires a place holder for every combination of variable values—even if no data was collected for that combination.
In contrast, the long data format does not require a place holder for each combination of variable values. Since each combination is saved as its own row, you can simply not include rows that contain an NA.
The tidyr package provides a convenient function for dropping rows that contain an NA in a specific column. The function is drop_na(). To use it, give drop_na() a data set (perhaps via a pipe), then list one or more columns in that data set, e.g.
data %>% drop_na(column1, column2)
drop_na() will drop every row that contains an NA in one or more of the listed columns.
Add drop_na() to the pipe below to drop every row that has an NA in the n column.
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", new_sp_m014:new_rel_f65) %>%
separate(codes, into = c("new", "type", "sexage"), sep = "_") %>%
separate(sexage, into = c("sex", "age"), sep = 1) %>%
select(-new)
## # A tibble: 99,000 × 6
## country year type sex age n
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 1980 sp m 014 NA
## 2 Afghanistan 1981 sp m 014 NA
## 3 Afghanistan 1982 sp m 014 NA
## 4 Afghanistan 1983 sp m 014 NA
## 5 Afghanistan 1984 sp m 014 NA
## 6 Afghanistan 1985 sp m 014 NA
## 7 Afghanistan 1986 sp m 014 NA
## 8 Afghanistan 1987 sp m 014 NA
## 9 Afghanistan 1988 sp m 014 NA
## 10 Afghanistan 1989 sp m 014 NA
## # … with 98,990 more rows
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", new_sp_m014:new_rel_f65) %>%
separate(codes, into = c("new", "type", "sexage"), sep = "_") %>%
separate(sexage, into = c("sex", "age"), sep = 1) %>%
select(-new) %>%
drop_na(n)
## # A tibble: 12,809 × 6
## country year type sex age n
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
## 8 Afghanistan 2004 sp m 014 139
## 9 Afghanistan 2005 sp m 014 151
## 10 Afghanistan 2006 sp m 014 193
## # … with 12,799 more rows
Good job! You’ve wrangled who into a tidy, polished data set that is ready to be explored, modelled, and analyzed.
The difference between the initial and final versions of who is drastic, but each step in our pipe imposed a small, logical change. This is by design.
The tidyverse contains a vocabulary of functions that each do one simple thing, but can be combined to do more sophisticated tasks. In this way, the tidyverse is like a written language, it is made up of words (functions) that can be combined into sentences that have a sophisticated meaning (pipes).
This approach also makes it easier to solve problems with code. You can approach any problem by decomposing it into a series of small, simple steps.
Complete your data wrangling education by learning to work with relational data. Here you will learn how to augment data sets with information from related data sets, as well as how to filter one data set against another.
Data often comes as multiple data sets that are related to each other. When this happens, the data will be easier to analyze if you join the data sets into a single table. This tutorial will teach you several functions that join data sets together. These functions do something sophisticated: they match rows from one data set to corresponding rows in another data set, even if the rows appear in a different order. The functions are:
Each of these functions come in the dplyr package, not the tidyr package. You may wonder why we are learning about them in the Tidy Data primer. Joins are a useful component of data tidying; your data can hardly be tidy if observations are split across multiple data frames where they are listed in different orders.
This tutorial uses the core tidyverse packages, including dplyr, as well as the nycflights13 package. All of these packages have been pre-installed and pre-loaded for your convenience.
Click the Next Topic button to begin.
Flight delays are an unfortunate aspect of air travel. If you’ve flown more than a handful of times, you’ve probably experienced a delayed flight, which may make you wonder: is it possible to predict which flights will be delayed?
The flights data set in the nycflights13 package provides some relevant information. It contains details of every flight that departed from an airport that serves New York City in 2013. Let’s use it to explore which airlines have the largest flight delays.
flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
The carrier variable of flights uses a carrier code to identify which airline operated each flight. This gives us a strategy for comparing the average delay time by airline:
Use dplyr functions in the code chunk below to enact this strategy. Which airlines have the largest average delays?
flights %>%
drop_na(arr_delay) %>%
group_by(carrier) %>%
summarise(avg_delay = mean(arr_delay)) %>%
arrange(desc(avg_delay))
## # A tibble: 16 × 2
## carrier avg_delay
## <chr> <dbl>
## 1 F9 21.9
## 2 FL 20.1
## 3 EV 15.8
## 4 YV 15.6
## 5 OO 11.9
## 6 MQ 10.8
## 7 WN 9.65
## 8 B6 9.46
## 9 9E 7.38
## 10 UA 3.56
## 11 US 2.13
## 12 VX 1.76
## 13 DL 1.64
## 14 AA 0.364
## 15 HA -6.92
## 16 AS -9.93
"Good job! You've calculated the average delay per airline, but the results are difficult to interpret. We don't know which codes are associated with which airlines."
Our results show that the carrier F9 had the worst record for delays in the New York City area in 2013. But unless you are an air traffic controller, you probably do not know which airline has the carrier code F9.
Luckily, the nycflights13 package comes with another data set, airlines, which matches the name of each airline to its carrier code.
airlines
## # A tibble: 16 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
While you could look up F9 manually in airlines, and then repeat that process for every other code, the task would not be enjoyable. Your boss or your client will probably not be as willing as you to do it.
A better solution would be to join the airlines data set to your results programatically. In other words, to instruct R to add the name that is associated with each carrier code in airlines to the row that is associated with each carrier code in your results.
This is easy to do with one of dplyr’s four join functions: left_join(), right_join(), full_join(), and inner_join(). Each performs a variation of the basic task above.
The easiest way to learn how join functions work is visually. To this end, I’ve created some small toy data sets that we can visualize in their entirety: band and instrument, which look like this:
image
Notice that each data set has a column named name. Also, notice that each data set contains a row about John and a row about Paul. If you know a little about The Beatles, you’ll recognize that these rows match: they describe the same people. On ther other hand, the rows named Mick and Keith do not match any rows in the other data set. Finally, notice that the matching rows do not appear in the same place in each data set. For example, John is in the second row of band, but the first row of instrument.
These small data sets do a good job of matching the haphazard nature of real data. Our job will be to join them into a single data set that correctly matches the John and Paul rows to each other.
If you wish to see the raw data in band and instrument, take a peek by running the code below.
band
## # A tibble: 3 × 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
instrument
## # A tibble: 3 × 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
Let’s look at each dplyr join function and then deconstruct their syntax.
The left_join() function returns a copy of a data set that is augmented with information from a second data set. It retains all of the rows of the first data set, and only adds rows from the second data set that match rows in the first.
So here, Mick is retained in the result (with an NA in the appropriate spot) because Mick appears in the first data set. On the other hand, Kieth does not appear in the result because Keith does not appear in the first data set.
To see what this result
looks like in R, run the code below.
band %>% left_join(instrument, by = "name")
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
right_join() does the opposite of left_join(); it retains every row from the second data set and only adds rows from the first data set that have a match in the second data set. Now Keith appears in the result because Keith appears in the second data set. On the other hand, Mick does not appear in the result because he does not appear in the second data set.
image
You can think of left_join() as prioritizing the first data set, and right_join() as prioritizing the second. To see the results in R, run the code below.
band %>% right_join(instrument, by = "name")
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
## 3 Keith <NA> guitar
How can you swap the names in the code below to attain the results pictured in the right join diagram (don’t worry about the order of the columns the result).
image
band %>% left_join(instrument, by = "name")
band %>% right_join(instrument, by = "name")
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
## 3 Keith <NA> guitar
"Good Job! Since right and left joins are analagous, you can acheive the same results by switching the order of the data sets in a left join. Notice that this will affect the column order."
A full_join() is more inclusive than either a right_join() or a left_join(). A full_join() retains every row from each data sets, inserting NA placeholders throughout the results as necessary.
This is the only join that does not lose any information from the original data sets. Both Mick and Kieth appear in the results.
To see what this result
looks like in R, run the code below.
band %>% full_join(instrument, by = "name")
## # A tibble: 4 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
In contrast, an inner_join() is the most exclusive join. It only retains rows that appear in both data sets. As a result, only John and Paul appear in the result. Mick and Keith are left behind.
image
To see what this result looks like in R, run the code below.
band %>% inner_join(instrument, by = "name")
## # A tibble: 2 × 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
These four joins, left_join(), right_join(), full_join(), and inner_join(), are called mutating joins because they each return a copy of a data set that has been augmented with new information, just as mutate() returns a copy of a data set that has been augmented with new information.
Each function uses the same syntax:
left_join(band, instrument, by = "name")
right_join(band, instrument, by = "name")
full_join(band, instrument, by = "name")
inner_join(band, instrument, by = "name")
First, pass the function the names of two data sets to join.
Then set the by argument to the name or names of the column or columns to join on. These names should be passed as a vector of character strings, i.e. characters surrounded by quotes. In the code above, we join on a single column so our vector of strings simplifies to a single string, but you could imagine doing something like left_join(band, instrument, by = c(“first”, “last”)).
Each column name in by should appear in both data sets. The join function will match together rows that have identical combinations of values in the columns listed in by. If you do not specify a by argument, dplyr will join on the set of all column names that appear in both data sets.
Now that you’ve familiarized yourself with the mutating join functions, let’s use one to finish our airlines query. Add two more lines to the code below.
flights %>%
drop_na(arr_delay) %>%
group_by(carrier) %>%
summarise(avg_delay = mean(arr_delay)) %>%
arrange(desc(avg_delay)) %>%
left_join(airlines, by = "carrier") %>%
select(name, avg_delay)
## # A tibble: 16 × 2
## name avg_delay
## <chr> <dbl>
## 1 Frontier Airlines Inc. 21.9
## 2 AirTran Airways Corporation 20.1
## 3 ExpressJet Airlines Inc. 15.8
## 4 Mesa Airlines Inc. 15.6
## 5 SkyWest Airlines Inc. 11.9
## 6 Envoy Air 10.8
## 7 Southwest Airlines Co. 9.65
## 8 JetBlue Airways 9.46
## 9 Endeavor Air Inc. 7.38
## 10 United Air Lines Inc. 3.56
## 11 US Airways Inc. 2.13
## 12 Virgin America 1.76
## 13 Delta Air Lines Inc. 1.64
## 14 American Airlines Inc. 0.364
## 15 Hawaiian Airlines Inc. -6.92
## 16 Alaska Airlines Inc. -9.93
airlines is not the only data set in nycflights13 that expands upon flights. nycflights13 contains a total of five data sets that each focus on a related aspect of air travel.
The diagram below lists the column names for each data set. You can see that each data set shares one or more common columns with flights. Let’s use one to answer a new query.
#### Which airports
have the largest arrival delays?
Let’s repeat our last investigation to see which destinations have the largest average arrival delays. By swapping carrier with dest we arrive at
flights %>%
drop_na(arr_delay) %>%
group_by(dest) %>%
summarise(avg_delay = mean(arr_delay)) %>%
arrange(desc(avg_delay))
## # A tibble: 104 × 2
## dest avg_delay
## <chr> <dbl>
## 1 CAE 41.8
## 2 TUL 33.7
## 3 OKC 30.6
## 4 JAC 28.1
## 5 TYS 24.1
## 6 MSN 20.2
## 7 RIC 20.1
## 8 CAK 19.7
## 9 DSM 19.0
## 10 GRR 18.2
## # … with 94 more rows
But we face a similar problem. How can we replace the dest codes with names?
Luckily, the airports data set shows the names associated with each code. But look closely at airports:
airports
## # A tibble: 1,458 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
## # … with 1,448 more rows
Which variable name does airports use for the airport codes?
Correct!
This makes it difficult to join these two data sets because flights and airports use different column names for the codes columns (dest and faa).
airports and flights share a common variable, airport codes, but save the variable under different column names dest and faa. This is a common occurence with data. We can recreate it by making a second instrument data set that replaces the first column name with artist.
instrument2
## # A tibble: 3 × 2
## artist plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
We can still join band to insturment2, but we will need to tell R to match the name column to the artist column. To do this, you will need to know a little about how to name the elements of a vector.
image
A named vector is a vector whose elements have been given names. To create a named vector, simply assign names to each element of the vector when you create the vector with c().
For example, this creates an unnamed vector:
c(1, 2, 3)
## [1] 1 2 3
And this creates a named vector. Here the first element is named “uno”, the second is named “dos”, and so on.
c(uno = 1, dos = 2, tres = 3)
## uno dos tres
## 1 2 3
If you like, you can place quotes around the names when you make the vector, like c(“uno” = 1, “dos” = 2, “tres” = 3). You’ll see me do that in the next section to make things look symmetric.
Named vectors are a basic feature of R. Let’s look at how we can use them to solve our join problem.
To match on columns with different names, change the by argument of your join function from a vector of character strings to a named vector of character strings.
band %>% left_join(instrument2, by = c("name" = "artist"))
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
R will match the column in the first data set that has the name (here “name”) with the column in the second data set that has the element (here “artist”).
image
To see what the result looks like in R, run the code below.
band %>% left_join(instrument2, by = c("name" = "artist"))
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
You can use this syntax to describe multiple pairs of columns. For example,
foo %>% left_join(foo2, by = c("first" = "artist1", "last" = "artist2"))
Technically, you do not need to surround the names of the vector with quotes. This would work.
foo %>% left_join(foo2, by = c(first = "artist1", last = "artist2"))
But you do need to use quotes in the elements of the vector, which are character strings. I like to use quotes on both sides of the = for parity.
Complete our code below to show the name of each destination paired with its average arrival delay.
flights %>%
drop_na(arr_delay) %>%
group_by(dest) %>%
summarise(avg_delay = mean(arr_delay)) %>%
arrange(desc(avg_delay))
flights %>%
drop_na(arr_delay) %>%
group_by(dest) %>%
summarise(avg_delay = mean(arr_delay)) %>%
arrange(desc(avg_delay)) %>%
left_join(airports, by = c("dest" = "faa")) %>%
select(name, avg_delay)
## # A tibble: 104 × 2
## name avg_delay
## <chr> <dbl>
## 1 Columbia Metropolitan 41.8
## 2 Tulsa Intl 33.7
## 3 Will Rogers World 30.6
## 4 Jackson Hole Airport 28.1
## 5 Mc Ghee Tyson 24.1
## 6 Dane Co Rgnl Truax Fld 20.2
## 7 Richmond Intl 20.1
## 8 Akron Canton Regional Airport 19.7
## 9 Des Moines Intl 19.0
## 10 Gerald R Ford Intl 18.2
## # … with 94 more rows
"Good Job! Flights from NYC to Columbia, South Carolina seem to have arrived particularly late in 2013. At the other end of the list, far off destinations in Alaska and Hawaii tended to arrive ahead of schedule."
The four join functions cover all of the ways you can combine information from one data set with another data set.
If you wish to combine more than two data sets, you can run the joins sequentially, first joining two data sets, then joining the result to a third, and so on. This process is easy to automate with the reduce() function in the purrr package.
The next Topic will look at a group of joins that do something surprisingly different.
Let’s look more closely at the destinations of flights from New York City.
To do this we will use a new type of join: a filtering join. Filtering joins are different than mutating joins in that they do not add new data to a data set. Instead, they filter the rows of a data set based on whether or not the rows match rows in a second data set.
dplyr comes with two filtering join functions:
Both follow the same syntax as the mutating joins.
semi_join() returns every row in the first data set that has a match in the second data set. So, for example, here semi_join() returns the John and Paul rows of band. Notice that semi_join() has not added anything to those rows.
To see what the results
look like in R, run the code below.
band %>% semi_join(instrument, by = "name")
## # A tibble: 2 × 2
## name band
## <chr> <chr>
## 1 John Beatles
## 2 Paul Beatles
anti_join() does just the opposite of semi_join(); it returns all of the rows in the first data set that do not have a match in the second data set.
image
band %>% anti_join(instrument, by = "name")
## # A tibble: 1 × 2
## name band
## <chr> <chr>
## 1 Mick Stones
We will also use a new function that comes in dplyr: distinct(). distinct() isn’t a join function, but it is incredibly useful. distinct() returns the distinct values of a column.
instrument %>% distinct(plays)
## # A tibble: 2 × 1
## plays
## <chr>
## 1 guitar
## 2 bass
image
If you do not supply a column, distinct() returns the distinct rows of the data frame, removing duplicates.
Now let’s put these three functions to work.
Use distinct() below to determine how many airports New York City connects to. This will be the number of distinct destinations in the flights data set. First create a data set with these destinations, then look for the number of rows in the data (it appears beneath the table in the results).
flights %>%
distinct(dest)
## # A tibble: 105 × 1
## dest
## <chr>
## 1 IAH
## 2 MIA
## 3 BQN
## 4 ATL
## 5 ORD
## 6 FLL
## 7 IAD
## 8 MCO
## 9 PBI
## 10 TPA
## # … with 95 more rows
Now let’s replace these codes with recognizable names. Add to the code below to left join our results to airports. Remember that the two data sets use different column names. Then select just the name column.
flights %>%
distinct(dest)
flights %>%
distinct(dest) %>%
left_join(airports, by = c("dest" = "faa")) %>%
select(name)
## # A tibble: 105 × 1
## name
## <chr>
## 1 George Bush Intercontinental
## 2 Miami Intl
## 3 <NA>
## 4 Hartsfield Jackson Atlanta Intl
## 5 Chicago Ohare Intl
## 6 Fort Lauderdale Hollywood Intl
## 7 Washington Dulles Intl
## 8 Orlando Intl
## 9 Palm Beach Intl
## 10 Tampa Intl
## # … with 95 more rows
Rolling back our results just a bit, you can see that some codes did not have a match with in the airports data set.
flights %>%
distinct(dest) %>%
left_join(airports, by = c("dest" = "faa")) %>%
select(dest, name)
## # A tibble: 105 × 2
## dest name
## <chr> <chr>
## 1 IAH George Bush Intercontinental
## 2 MIA Miami Intl
## 3 BQN <NA>
## 4 ATL Hartsfield Jackson Atlanta Intl
## 5 ORD Chicago Ohare Intl
## 6 FLL Fort Lauderdale Hollywood Intl
## 7 IAD Washington Dulles Intl
## 8 MCO Orlando Intl
## 9 PBI Palm Beach Intl
## 10 TPA Tampa Intl
## # … with 95 more rows
This is unexpected. It would be useful to see which codes did not have a match. Extend the code below with a filtering join to return just the rows that do not have a match in airports.
flights %>%
distinct(dest)
flights %>%
distinct(dest) %>%
anti_join(airports, by = c("dest" = "faa"))
## # A tibble: 4 × 1
## dest
## <chr>
## 1 BQN
## 2 SJU
## 3 STT
## 4 PSE
anti_join() provides an easy way to double check a join. It shows whether or not all of the rows that you think will have a match will have a match.
Its not uncommon for anti_join() to return values that have a misspelling or typo that prevents the join. Keep in mind that the typo could be in either data set.
Here, these appear to be real airport codes that have been overlooked by airports. We cannot check the names of these four airports because, by definition, they are not in our data set of airport names.
Let’s gauge how this affects our data. Use the code chunk below to return all of the flights that do match an airport in airports. Be sure to use a filtering join, not a mutating join.
flights %>%
semi_join(airports, by = c("dest" = "faa"))
## # A tibble: 329,174 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 554 600 -6 812 837 -25 DL
## 5 2013 1 1 554 558 -4 740 728 12 UA
## 6 2013 1 1 555 600 -5 913 854 19 B6
## 7 2013 1 1 557 600 -3 709 723 -14 EV
## 8 2013 1 1 557 600 -3 838 846 -8 B6
## 9 2013 1 1 558 600 -2 753 745 8 AA
## 10 2013 1 1 558 600 -2 849 851 -2 B6
## # … with 329,164 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights %>%
anti_join(airports, by = c("dest" = "faa"))
## # A tibble: 7,602 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 544 545 -1 1004 1022 -18 B6
## 2 2013 1 1 615 615 0 1039 1100 -21 B6
## 3 2013 1 1 628 630 -2 1137 1140 -3 AA
## 4 2013 1 1 701 700 1 1123 1154 -31 UA
## 5 2013 1 1 711 715 -4 1151 1206 -15 B6
## 6 2013 1 1 820 820 0 1254 1310 -16 B6
## 7 2013 1 1 820 820 0 1249 1329 -40 DL
## 8 2013 1 1 840 845 -5 1311 1350 -39 AA
## 9 2013 1 1 909 810 59 1331 1315 16 AA
## 10 2013 1 1 913 918 -5 1346 1416 -30 UA
## # … with 7,592 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
How would you write a filter() statement that finds just the flights that:
It can be done—as can many other complicated filters. But you may find it easier to perform complicated filters with semi_join() instead of filter().
For example, you can create a data set that has the combinations you want:
criteria <- tribble(
~month, ~carrier,
1, "B6", # B6 = JetBlue
2, "WN" # WN = Southwest
)
criteria
## # A tibble: 2 × 2
## month carrier
## <dbl> <chr>
## 1 1 B6
## 2 2 WN
Then you can run a semi_join() against the data set. Use criteria and semi_join() below to return just the flights that left in January on JetBlue or in February on Southwest.
flights %>%
semi_join(criteria)
## Joining, by = c("month", "carrier")
## # A tibble: 5,338 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 544 545 -1 1004 1022 -18 B6
## 2 2013 1 1 555 600 -5 913 854 19 B6
## 3 2013 1 1 557 600 -3 838 846 -8 B6
## 4 2013 1 1 558 600 -2 849 851 -2 B6
## 5 2013 1 1 558 600 -2 853 856 -3 B6
## 6 2013 1 1 559 559 0 702 706 -4 B6
## 7 2013 1 1 600 600 0 851 858 -7 B6
## 8 2013 1 1 601 600 1 844 850 -6 B6
## 9 2013 1 1 613 610 3 925 921 4 B6
## 10 2013 1 1 615 615 0 1039 1100 -21 B6
## # … with 5,328 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Filtering joins filter a data set against the observations in a second data set. They are called joins because they use information from both data sets. However, they use this information to filter—not augment—the original data.
distinct() is not a join at all, but it does filter data sets in a useful way.
The last topic in this tutorial will cover straight-forward ways for combining data sets. These methods require your data sets to be pre-formatted to fit together and they are fairly simple to understand.
Join functions specialize in data sets that relate to each other, but are not preformatted to fit together.
Sometimes, however, you may wish to paste together data sets that already “fit together”, as if they were split as is from some master data set. The functions in this topic will show you how.
Consider the two data sets below. Notice that they contain different variables, but identical observations. For example, the first row in beatles1 aligns with the first row of beatles2, the second row aligns with the second row, and so on.
You wouldn’t need to do a
join to combine these data sets, you’d just need to paste them together.
How could you do it?
If your data sets contain the same observations, in the same order, you can combine them together with bind_cols()
image
Run the code below to see what the results look like in R.
beatles1 %>% bind_cols(beatles2)
## # A tibble: 4 × 4
## band name surname instrument
## <chr> <chr> <chr> <chr>
## 1 Beatles John Lennon guitar
## 2 Beatles Paul McCartney bass
## 3 Beatles George Harrison guitar
## 4 Beatles Ringo Starr drums
Note that this is a dangerous way to store your data, because it is hard to ensure that the rows of one data set haven’t gotten jumbled. bind_cols() cannot tell whether the rows are in the correct order or not, so you will need to be careful in these situations.
These data sets provide the opposite case, which is more common. Here each data set contains the same variables, but different observations. You could think of band2 as a continuation of band.
image
Use bind_rows() to combine data sets that contain the same variables, but different observations.
Run the code below to
see what the results look like in R.
band %>% bind_rows(band2)
## # A tibble: 6 × 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
## 4 Ringo Beatles
## 5 Ronnie Stones
## 6 Mick Stones
When conbining data with bind_rows(), it can be useful to add a new column that shows where each row came from.
image
The easiest way to do this is to save the input data sets as a named list and call bind_rows on the list. Then add the argument .id to your bind_rows() call and set .id to a character string. bind_rows() will use the character string as the name of a new column that displays the name of the data set that each row comes from (as determined by the names in the list).
If you’d like to refresh your understanding of lists in R, revisit the Programming Basics tutorial.
Add a .id argument ot the code below to create the output displayed in the diagram.
bands <- list(df1 = band,
df2 = band2)
bands %>% bind_rows()
bands <- list(df1 = band,
df2 = band2)
bands %>% bind_rows(.id = "origin")
## # A tibble: 6 × 3
## origin name band
## <chr> <chr> <chr>
## 1 df1 Mick Stones
## 2 df1 John Beatles
## 3 df1 Paul Beatles
## 4 df2 Ringo Beatles
## 5 df2 Ronnie Stones
## 6 df2 Mick Stones
"Good Job! You can add more than two data sets to your list if you wish to bind together multiple data sets at once."
Did you notice that bands and bands2 contain a duplicate row? Each contains a row for Mick.
When your data sets contain the same variables and overlapping sets of observations, you can use traditional set operations to return a reduced set of rows drawn from the data sets.
Imagine what each of the
set operations below will return when applied to the data sets above.
Then run the code to check if you are right.
band %>% union(band2)
## # A tibble: 5 × 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
## 4 Ringo Beatles
## 5 Ronnie Stones
band %>% intersect(band2)
## # A tibble: 1 × 2
## name band
## <chr> <chr>
## 1 Mick Stones
band %>% setdiff(band2)
## # A tibble: 2 × 2
## name band
## <chr> <chr>
## 1 John Beatles
## 2 Paul Beatles
band2 %>% setdiff(band)
## # A tibble: 2 × 2
## name band
## <chr> <chr>
## 1 Ringo Beatles
## 2 Ronnie Stones
union() returns every row that appears in either data set, but it removes duplicate copies of the rows.
band %>% union(band2)
image
intersect() returns only the rows that appear in both data sets. It too removes duplicate copies of these rows.
band %>% intersect(band2)
image
setdiff() returns all of the rows that appear in the first data set but not the second.
band %>% setdiff(band2)
image
Master a core programming paradigm with the purrr package: for each ____ do ____.
Iteration is the task of applying a function iteratively to each element in a vector. This tutorial will explain what a vector is (it might not be what you think!) and introduce three three ways to do iteration in R: for loops, the lapply functions, and the purrr package.
purrr’s family of map functions makes iteration quick and easy. Here you will learn the ins and outs of map() and its variants.
Here you will learn map()’s built in shortcuts for the most common map tasks, as well as an expression language that lets you map more than functions.
Now that you know how to iterate over a single vector, it is time to learn how to iterate over two or more vectors at once, or even a vector of functions.
Here you will use your iteration skills to overhaul your entire data analysis workflow (if you want). Iteration facilitates a very useful new way to organize the products of data science.
Functions are the key to programming in R. This primer will teach you how to write and use your own reusable functions.
Start here to learn what a function really is. This quick tutorial explains the structure of functions and how to call them.
Here it is! The best practice workflow for writing your own functions in R. You’ll also learn some shortcuts for converting common types of code into R functions.
Arguments are the user interface (UI) to your functions. This extended quiz will teach you the ins and outs of writing, and using, a good argument UI. Environments and Scoping Rules Here you will learn how R stores and looks up objects.
Control flow refers to the order in which a function executes its code. Here you’ll learn how to run specific code in specific cases with if and else, and how to stop function execution early with return() and stop().
Next it is time to learn how to combine logical tests in if statements, as well as how to write if statements that work with vectors. This is a prerequisite for using if in vectorized functions.
Learn to repeat code with R’s repeat, while, and for loops. You’ll also learn to recognize when you should or shouldn’t use loops in R.
Learn to report, reproduce, and parameterize your work with the best authoring format for Data Science: R Markdown.
Say hello to Shiny, R’s package for building interactive web apps. Learn to turn your analyses into elegant tools to share with others
Become an R guru by mastering all of the tools built into the RStudio IDE. Discover best practices for programming, debugging, version control, package building and more.